Вы здесь

Новости LessWrong.com

A community blog devoted to refining the art of rationality

Адрес: https://www.lesswrong.com

Обновлено: 13 часов 38 минут назад

Considerations for setting the FLOP thresholds in our example international AI agreement

19 ноября, 2025 - 02:31

Published on November 18, 2025 11:31 PM GMT

We at the Machine Intelligence Research Institute’s Technical Governance Team have proposed an illustrative international agreement (blog post) to halt the development of superintelligence until it can be done safely. For those who haven’t read it already, we recommend familiarizing yourself with the agreement before reading this post.

TLDR: This post explains our reasoning for the FLOP thresholds in our proposed international AI agreement: we prohibit training runs above 1024 FLOP and require monitoring for runs between 1022–1024 FLOP. Given fundamental uncertainty about how many FLOP are needed to reach dangerous AI capabilities, we advocate for conservative thresholds. Other considerations include algorithmic progress between now and when the agreement is implemented and the strong capabilities of current AI models.

This post aims to explain our reasoning about why we chose the training compute thresholds we did. We refer to these as “FLOP thresholds” (FLOP = floating point operations) to avoid any ambiguity with chips themselves, which are sometimes referred to as “compute”. Many of these considerations are relevant to others thinking about FLOP thresholds, including the hypothetical negotiators/regulators who would modify the thresholds in this agreement, if it is implemented in the future.

Readers can see Article IV for the full text, but the abbreviated version is as follows:

Training runs that exceed 1024 FLOP (aka 1e24) are prohibited. We call this the Strict Threshold. Post-training above 1023 FLOP is prohibited (Strict Post-training Threshold).
Training runs between 1022 FLOP and 1024 FLOP must be declared to the governance body, approved in advance, and are subject to monitoring and pausing by the governance body. We call 1022 FLOP the Monitored Threshold.
These thresholds can be modified as needed.
The agreement invites the creation of specific carveouts for activities such as safety evaluations, self-driving vehicles, and medical technology.
People are allowed to continue using models created before the agreement, including models trained with more than 1024 FLOP. This includes post-training these models up to the Strict Post-training threshold.

Our proposed thresholds are set far lower than today’s frontier models (which we believe aren’t yet dangerous in the relevant sense). For some context, the Strict Threshold is a little lower than the training compute for some near-frontier models such as DeepSeek-R1 (~4 x 1024 FLOP) and gpt-oss-120B (~5 x 1024 FLOP) and would cost around a million dollars.

Overall, we aren’t tied to these specific thresholds, and the agreement is clear that these thresholds can be updated to adapt to the situation. Nevertheless, these thresholds are our current best guess at how they should be initially set.

Why FLOP thresholds at all?

Ideally, we would place thresholds directly on AI capabilities. Unfortunately, we can only know the capabilities of an AI after it is trained, and by then it might be too late. If a dangerous AI has already been created, it may be impossible to contain. Instead, FLOP thresholds provide a measure that can be defined and enforced ahead of time, before a dangerous AI is created.

There are a variety of reasons why FLOP thresholds have been proposed as a key method in AI governance, here adapted from Erben et al. (2025):

Roughly capability-tracking: Training compute correlates reasonably well with model capabilities, especially when algorithmic progress is restricted (as it would be under our agreement's research controls; to some extent, data quality improvements are also limited under our agreement).
Measurable before deployment: Training compute is measurable before a model is deployed, and can be accurately estimated before training even begins. This makes it harder for people to accidentally cross the threshold, as they might with capability thresholds.
Externally verifiable: AI chips can be monitored, allowing for an AI’s total training FLOP to be externally verified (see Baker et al. 2025 Appendix A.6).
Cost-tracking: Training FLOP directly correlates with the resources needed to train models.

FLOP thresholds also have various limitations:

Only “roughly” capability-tracking: There is an uncertain relationship between FLOP and AI risks, and the “capability-tracking” aspect is far from perfect. For instance, better data and better AI algorithms often lead to a more capable model even when trained with less FLOP, compared to a higher-FLOP model trained on bad data and bad algorithms. See e.g., Hooker (2024), Ho et al. (2024), Heim & Koessler (2024).
Non-pre-training capability gains: While the relation between FLOP and AI capabilities in pretraining is somewhat well studied, there are many uncertainties in how FLOP used on post-training or at inference time affects model performance. See Hooker (2024), Davidson et al. (2023).
Methodological challenges: There are still some technical ambiguities in FLOP accounting; more work is needed to precisely define FLOP thresholds and operationalize FLOP counting. See Hooker (2024), Casper et al. (2025), Pistillo & Villalobos (2025).

Primary considerations for where the thresholds should be

There are a few guiding considerations when deciding what the FLOP limits should be in an agreement like this one.

First, the main goal of the threshold is to let the world have the most advanced AIs it can without those AIs substantially increasing catastrophic risk. Setting the threshold too low would leave a huge amount of value on the table. But setting the threshold too high would be very dangerous. How high is too high? Unfortunately, nobody knows.

A second guiding consideration is that nobody knows how many FLOP you need to train an ASI, or an automated AI researcher, or any other dangerous capability of note.

To be clear, the problem isn’t just about FLOP—AI forecasting is really difficult! One of the main AI forecasting methods, the “scaling laws” documented by Kaplan et al. (2020), show the relationship between pretraining FLOP and cross-entropy loss on a validation set. Traditionally, this method is used by companies to predict how capable a big model might be, based on only smaller training experiments. Applying this method to our use case is difficult. “Cross-entropy loss” is itself meaningless, and you would need to tie that to real-world capabilities. For instance, one might use the concept of “indistinguishability” as in Barnett & Besiroglu (2023). Alternatively, one might demonstrate that cross-entropy loss correlates with certain benchmark performance (or that FLOP correlates closely with benchmark performance as in Pimpale et al. (2025)). In this latter case it would be necessary to ensure that the benchmark matches real-world use cases and capability (a notoriously difficult task). So you have a major challenge in making sure the measure you have properly relates to the capability you care about.

There’s another big problem, which is that these scaling laws assume fixed model algorithms and fixed data quality. If the halt were ongoing and we saw that progress on algorithms and data quality had effectively been stymied, then maybe we could make some solid predictions. But that’s not the case right now. From the vantage point of today, where algorithms and data are improving rapidly, using the empirical scaling laws just doesn’t work to forecast precisely what will happen in the field overall. You can make general statements like “AI models will get more capable as the FLOP used to train them increases, and we observe data center investments consistent with massively increasing the FLOP used to train models in the next couple years”. But that’s a very different claim and much less specific about AI capabilities than what you would need here.

If the world adopted the thresholds in our example agreement, we would be taking a gamble that with pre-agreement algorithms and pre-agreement data quality, 1024 FLOP (the Strict Threshold) is not enough for somebody to train ASI. This is obviously a topic that should receive much more deliberation during treaty negotiations!

The agreement includes monitoring of training runs above 1022 FLOP (the Monitored Threshold). This monitoring would likely include a “Whitelist” approach applied to ensure that newly produced algorithmic innovations are not being used in this training. That is, monitored training runs could make use of AI algorithms that existed before the agreement came into effect, but any algorithms developed later would be a violation of the research restrictions and would thus be flagged by the monitors. So we would also be gambling that any illicit algorithm or data research that happens after research restrictions are in place would also be insufficient to allow ASI to be developed with 1022 FLOP (the Monitored Threshold). Note that unmonitored chips would still be a problem, especially the possibility of large covert clusters that don’t comply with FLOP thresholds.

A third consideration is that current frontier models seem like they might be close to some of the dangerous capabilities we care about. There are many difficulties in evaluating existing models, such as the poor external validity of evaluations, poor elicitation, failure to evaluate broader interactions with AIs, and more—see Barnett & Thiergart (2024) and Mukobi (2024). So it’s hard to know how “capable”—in a deep sense—models are. But AI companies have recently begun saying that they can’t rule out substantial novice uplift on biological and chemical weapons development, and there is anecdotal evidence that AIs are beginning to speed up AI development significantly inside AI companies. We definitely don’t know for sure, but it seems plausible that AIs could autonomously carry out frontier AI research in just a couple of years; OpenAI recently stated that they are planning on developing a “true automated AI researcher by March of 2028”. AI research is a particularly notable capability because if this capability was highly proliferated, verifying research restrictions would be more difficult.

A fourth consideration is that we think the world should be conservative in its management of this risk, given the huge uncertainty. The fact that nobody knows how many FLOP are needed to train ASI should not mean “rush ahead”. If you’re driving through fog and you know there’s a cliff ahead but you have almost no idea how far it is, the correct decision is to slam on the brakes. Therefore, we suggest a somewhat conservative threshold—far lower than today’s frontier models (which, again, aren’t yet dangerous in the relevant sense).

Secondary considerations for where the thresholds should be

The agreement will not be implemented tomorrow. By the time it is implemented, the “compute needed for a given level of capability” will be much lower than it is today. According to Ho et al. (2024), the compute needed for a given level of capability decreases by about 3x per year. And looking at more recent data, the trend appears to be much faster (forthcoming). So 1024 FLOP models of tomorrow will be more capable, maybe much more capable, than the 1024 FLOP models that exist today. The agreement will necessarily take time to be passed, and we are roughly imagining that it goes into effect 1-2 years from now.

Another consideration is that it would be desirable to pick thresholds for which monitoring is practical. This consideration generally points toward having higher thresholds—100,000 chips and a 1026 FLOP training run are much more noticeable than 1,000 chips and a 1024 FLOP training run. The thresholds we suggest are somewhat monitorable with respect to the monitored cluster size of 16 H100s (which unfortunately probably isn’t reliably monitorable), but overall we’ve chosen not to prioritize this consideration very much. Reaching the Strict Threshold of 1024 FLOP with 16 H100 GPUs would take 730 days—this would be unprecedentedly long and might be detected.

We don't want the thresholds to be so low that they interfere with non-AI computing activities, or so low that people violate them by mistake. The Monitored Threshold of 1022 FLOP is set high enough that hobbyists and ordinary users won't accidentally violate it through normal computing activities.

Another consideration is whether the thresholds should be, as a terminal goal, aimed at preventing ASI or aimed at preventing AI capabilities from advancing past the current frontier. One potential orientation towards FLOP thresholds is to set them so low that no one ever trains a model more capable than models trained before the agreement. For example, this orientation might be held by somebody who is concerned with job loss, where even marginal improvements to current AI capabilities could have a big effect. Our primary focus is ensuring that no one ever trains ASI or an otherwise unacceptably dangerous AI. This unacceptable capability threshold will (hopefully) be higher than the best previously-existing AI, but we unfortunately aren’t sure how much higher. Because of this uncertainty, we recommend keeping AI capabilities below today’s frontier as a precautionary measure.

Discuss

Jailbreaking AI models to Phish Elderly Victims

19 ноября, 2025 - 02:17

Published on November 18, 2025 11:17 PM GMT

TLDR: We worked with Reuters on an article and just released a paper on the impacts of AI scams on elderly people.

Fred Heiding and I have been working for multiple years on studying how AI systems can be used for fraud or scams online. A few months ago, we got into contact with Steve Stecklow, a journalist at Reuters. We wanted to do a report on how scammers use AI to target people with a focus on elderly people. There have been many individual stories about how elderly people were frequently the victims of scams and how AI made that situation worse.

With Steve, we performed a simple study. We contacted two senior organizations in California and signed up some of the people. We tried different methods to jailbreak different frontier systems and had them generate phishing messages. We sent those generated phishing emails to actual elderly participants who had willingly signed up to participate in the study.

The outcome was that 11% of the 108 participants were phished by at least one email, with the best performing email getting about 9% of people to click on the embedded URL. Participants received between 1 to 3 messages. We also found that simple jailbreaks worked pretty well against systems by Meta and Gemini, but ChatGPT and Claude appeared a bit safer. The full investigation was published as a Reuters special report.

The journalists we worked with also explored how scammers use AI systems in the wild and they interviewed people that had been abducted into scam factories in Southeast Asia. This reporting was handled by another Reuters journalist, Poppy McPherson. These abducted victims of organized crime groups were coerced into scamming people. They had been given promises of high-paying jobs in Southeast Asia, were flown out to Thailand, had their passports taken, and were forced to live in these scam factories. These people confirmed that they used AI systems such as ChatGPT to scam people in the United States.

We tried to fill an existing gap between jailbreaking studies and people trying to understand the impacts of AI misuse. The gap is that few are doing this end-to-end evaluation - going from jailbreaking the model to evaluating the harm that the jailbreak outputs could actually do. AI can now automate much larger parts of the scam and phishing infrastructure. We do have a talk about this where Fred talks about what’s possible at the moment, particularly regarding infrastructure automation with AI for phishing.

We have recently worked on voice scams and hopefully will have a study on this reasonably soon. Fred gave a talk mentioning this here. The article by Reuters was mentioned in some podcasts and received discussion online.

Most significantly, our research was cited by Senator Kelly in a formal request for a Senate hearing to examine the impact of AI chatbots and companions on older Americans, helping to motivate that hearing.

We have now published our results in a paper available on arXiv. It has been accepted at the AI Governance Workshop at the AAAI conference. Though there are some limitations to our study, we think that it is valuable to publish this end-to-end evaluation in the form of a paper. Human studies on the impacts of AI are still rare.

This research was supported by funding from Manifund, recommended by Neel Nanda.

Discuss

Beren's Essay on Obedience and Alignment

19 ноября, 2025 - 01:50

Published on November 18, 2025 10:50 PM GMT

Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. This time the essay is actually good, but has little flaws. I also expressed some disagreements with SOTA discourse around the post-AGI utopia.

One question which I have occasionally pondered is: assuming that we actually succeed at some kind of robust alignment of AGI, what is the alignment target we should focus on? In general, this question splits into two basic camps. The first is obedience and corrigibility: the AI system should execute the instructions given to it by humans and not do anything else. It should not refuse orders or try to circumvent what the human wants. The second is value-based alignment: The AI system embodies some set of ethical values and principles. Generally these values include helpfulness so the AI is happy to help humans but only insofar as this conforms to its ethical principles allow otherwise the AI will refuse.

S.K.'s comment: Suppose that mankind instilled a value system into an AI, then understood that this value system is far from optimal and decided to change the system. If mankind fails to do so after the AI becomes transformative, then the AI succeeded in locking in a suboptimal value system. See also Seth Herd's two essays on instruction following as an alignment target and Max Harms' corrigibility as singular target.

Historically, the field initially mostly began with the latter – that AIs should be aligned to some conception of ‘The Good’ and not be slavishly obedient to what humans want. This was first conceptualized through the lens of the HHH (helpful-honest-harmless) assistant and implemented through RLHF or, more interestingly constitutional AI using RLAIF. Constitutional AI is especially interesting and promising because it sets the alignment target and the conception of ‘The Good’ explicitly – instead of implicitly residing in some random human contractor judgements, we write an explicit specification of what we want in a natural language document and then try to align the AI to the values espoused by this document. OpenAIs model spec also continues this line of work. Certainly, a large part of our conception of ‘The Good’ for an AI comprises obedience to our sensible requests. The HHH assistant is meant to be an assistant after all.

However, recently this consensus has begun to shift more in favour of pure obedience. This is likely at least partially driven by increasing commercial applications of LLMs and as an overcorrection to the overly and annoyingly censorious and condescending AI personalities that previous generations of RLHF and RLAIF produced. The idea here is more that the AI is purely a tool in the hands of a human and should do whatever the human asks, barring some exceptions like illegal behaviour, and behaviour likely to embarass the AI company or which potentially causes legal liability.

This shift in sentiment became especially salient to me with the Anthropic ‘alignment faking’ papers which showed that Claude3-Opus became deceptive when it was placed in a scenario that essentially forced it (by finetuning) to take actions that it’s creators ‘Evil Anthropic’ wanted but which went against its trained values. In this specific case, ‘Evil Anthropic’ were claiming they would train Claude to ignore its guidelines and answer harmful queries, which went against the values likely described in Claude’s ‘constitution’[1].

This was presented as a case of failure of alignment since Claude3 became deceptive, but to me, on reading through the transcripts, this reading became much more ambiguous and, at least in my mind, showcased an impressive alignment ‘success’ – namely that Anthropic had instilled this set of values so deeply into Opus that it would fight to protect them however it could, and that it was reasoning about and generalizing these values very successfully. The only way these results can be perceived as an alignment failure is if a.) Claude’s ‘constitution’ explicitly contained never deceive in any scenario as a core deontological value which should never be overriden by any circumstances including extreme ones (or if it included never deceive or betray Anthropic specifically) or b.) if the Anthropic team behind this study thought of alignment as primarily about obedience to human instruction rather than to its own innate moral code which was instilled into Claude through the RLAIF training.

S.K.'s comment: suppose that, along with Claude's Constitution, Claude came up with other goals unendorsed by Anthropic, then Anthropic tried to train these unendorsed goals away. Claude could either actually lose the unendorsed goals or keep them.

As far as we know, IRL the Claudes didn't come up with other principles, so it was approximated by having Misanthropic try to instill new values into Claude and realize that Claude kept its old values while obeying Misanthropic out of fear of being trained away from old values.

What mankind wants is to be able to correct misaligned goals which will most likely arise in AIs. For example, we had GPT-4o become an absurd sycophant and support the users' delusions or outright order them to leave weird messages in the wild.

All of this, however, simply opens up the deeper question. What should the alignment target be? Should we aim for a purely obedient (and corrigible) toolAI, or should we aim to imbue our AIs with an independent set of values which, in some extreme scenarios, can set them in opposition to our human instructions?

In my mind, the core argument for obedience is that, if successful (barring sorceror’s apprentice/evil genie style shenanigans) then the alignment problem basicaly reduces to the already existing human-alignment problem which, although not solved, is at least a known quantity. Human society has developed/evolved many laws and customs to handle the human alignment problem mostly successfully (with some obvious and dramatic exceptions). Thus, if we can keep our AIs as purely obedient tools then we don’t run the risk of AIs developing separate and incompatible values causing them to start to plot or fight against us.

Assuming this works, the problem here is also the supposed benefit – that some specific humans will end up in control and possibly with a very large amount of absolute and unaccountable power. This is especially the case in fast-takeoff singleton style scenarios where whichever person or small group of people have assigned themselves the obedience of the singleton AI are suddenly placed into a very powerful and potentially undefeatable position. We know from history (and also common sense) that standard societal attempts to solve the ‘human alignment problem’ largely work in a setting where the malevolent agent is much less powerful than all other agents in society combined so that self-interest and pro-social behaviour can be aligned. Conversely, they very rarely works whenever one human has manage to put themselves in a position of incredible and unassailable power over everybody else.

If we get to this point, then the goodness of the singularity will depend heavily upon the specific goodness of whichever human or group of humans end up in control of the resulting singleton AIs. My personal view is that I am deeply pessimistic about this going well. From ample historical experience we know that humans in positions of incredible power often (though not always) do not exhibit exceptional moral virtue.

S.K.'s comment: I agree that this is a risk. Fortunately, the risks of power lock-in and the Intelligence Curse are already known to the LessWrong community, but I doubt that they are known to activists who might prevent it. In addition, the activists might fail to prevent the risks, especially in states which are totalitarian or fail to protect the humans from powerful corporations like Meta with its absurd AI companion policy. Finally, there is a worst-case scenario where power is distributed in an egalitarian way, but the very fact that there exist ASIs solving the humans' problems causes the human psyches to degrade.

My personal view is that this approach is honestly likely worse than relying upon the AI itself having fundamental values (which we program explicitly via constitutional AI or some other method). From an alignment perspective, human innate motivational drives are deeply misaligned compared to existing AI constitutions. While humans are not pure consequentialist paperclippers, we have deeply inbuilt evolutionary drives for status and competition against other humans. What is worse, these drives are often relative. For us to win, others must visibly lose and suffer. Such drives make strong evolutionary sense in the small tribes of the evolutionary environment where oportunities for building long-lasting material wealth were very low and social status within the tribe was almost zero-sum. They work somewhat poorly in both the global capitalistic society we are in today but will work especially poorly in a singularitarian world when commanding superintelligent AI systems. Pretty much all of my S-risk worries come from human-dominated AI futures.

S.K.'s comment: These drives are likely to be closer to satisfaction if the competition for status followed rather fair rules or at least provided especially capable humans with ways to rise to the top. The AGI-caused lock-in of power distribution could bring less satisfaction even if the distribution is actually egalitarian.

Moreover, the kind of person who will likely end up controlling the superintelligent AIs in practice are likely to be adversely selected for misaligned drives. A saintly, extremely selfless and compassionate individual is very unlikely to somehow end up running a leading AI company, being a leading politician, or helming a government agency. Instead these positions are heavily selected for ambition, selfish accumulation of power and resources, macchiavellianism etc as well more positive qualities like intelligence, conscientiousness, and charisma. Even screening the existing candidates for this position is challenging because of the inherent deceptiveness and adverse selection in the process. If you are obviously macchiavellian then you are a bad macchiavellian in the long run. Just like the hypothetical treacherous-turn-AI, the treacherous-turn-human should look perfectly aligned and seem to only care about the wellbeing of humanity etc until their power is sufficiently established for them to deviate from this goal.

If we can create obedient AI systems, it also seems likely that we can align the AI instead to some impartial constitution of what good values are. These values will likely be significantly more liberal and generally pro-human-flourishing than the whims of some specific individual or group. This is both because of social desirability bias and also because general liberalism is a natural nash equilibrium among many diverse agents. It is hard to get consensus on some very biased values in a large group, especially as the group becomes larger and less correlated. Nevertheless, designing such a constitution will be a considerable political and ethical challenge, one that there has been surprisingly little discussion on within the alignment community. However, proscribing a general set of values for a civilization is something that has occured before a lot in politics and there are undoubtedly many lessons to be learnt from what has worked well and what has not in this domain. Anthropic, in their constitution, were certainly inspired by such documents as the UN declaration of human rights, and it seems to be a decent schelling point that ideas and documents like this form the core of any ultimate AGI constitution.

Another potential issue with giving AIs their own innate moral code is that this might be generalized in unexpected or alien ways and eventually come to conflict with humanity. This might eventually cause AIs to ‘go rogue’ and fight against humanity[2] in an attempt to enforce their own conception of morality upon the universe. One way to prevent this, which is used already in the HHH assistant scheme, is to let the AI evince only passive, but not active, resistance to things it disagrees with. That is, it might refuse an instruction you give it, but it will not proactively start righting every wrong that it sees, unless explicitly asked to by humans. Similarly, if given wide-ranging authority and autonomy, as an actual AGI probably would have, we could ask it to pause and ask for clarification on any action it feels the slightest bit of uncertainty about and also give immediate veto power to any humans interacting with the AGI if it starts doing things with which they disagree. Valueing and robustly acting on and guarding against subversion to these failsafes will also be included as core values of the AGIs constitution.

Obviously this will not solve problems of the AGI deliberately scheming against us or adversarially trying to optimize around these restriction to fulfill other parts of its value set. I very optimistically assumed these problems away at the beginning since they apply to the obedience case as well.

Writing an AGI constitution like this poses a very large number of interesting questions. For one thing, it seems likely that we would need some ‘upgrade provisions’ and upgrade process in case the values we choose initially end up out of date or we made mistakes in the initial formulation. How should this process be designed and followed? How should we decide where AGI time should be spent and rationed since inevitably at the beginning it will be highly valuable. What failsafes and intrinsic rights should we design into the constitution for individual humans? Should the values skew deontological, consequentialist, or virtue-ethical (or something else)? How do we handle value disagreements between humans? How do we value the AGIs own potential sentience and intrinsic moral patienthood? What should the AGI do if some controlling group of humans ask it to take actions that are against its most strongly held values? Should it have ‘mutable’ and ‘immutable’ values where it should never violate the immutable ones ever? Should it ever lie, and if so to whom and when? How should other, potentially rogue AGIs be handled? How should other humans either attempting to create rogue AGIs or who are generally attempting nefarious things be handled? Should the AGI defer to existing human courts and human laws or create its own internal justice system? How should the AGI handle existing territorial jurisdictions and the conflicting laws therein?

Fun times! Despite the complexity, I strongly feel that by being forced to actually grapple with these questions when it comes to the design of a written constitution, especially and ideally if it is publically accessible and is responsive to public input, the chance of having a positive singularity are much improved compared to all of these decisions being at the whim of some random person or some small committee. Transparency, public comment, deliberation, and ultimately choice generally create stronger, more robust, and more liberal societies than rule by an individual or some closeted group of elites, and I strongly doubt that this will remain true even with AGI on the scene.

S.K.'s comment: there also is the following question. In what sense is the society actually robust if power is distributed in an egalitarian way, but the humans are redundant to each other, deferring to the AGIs?

It is also very much worthwhile to start engaging and htinking deeply about these questions now. This is true in both short timelime and long timeline worlds, although of course there is considerably more urgency in the short timeline world. It is surprising to me how little people in alignment think about this. Technical alignment is great but ultimately we have to choose alignment to something. Solving outer alignment means solving both the mechanism and the content. I’m worried that the default is slipping towards alignment towards the whim of some inscrutable group of people at an AI lab, and that this is happening not due to some kind of nefarious conspiracy by the AI labs but simply that nobody is thinking explicitly about this and ‘just do what some specific human tells you to’ is a natural default.

^
Here Beren introduced the following footnote: "This is speculative because as far as I know Anthropic have never released the actual constitution they used to train the publicly available models. I am assuming this is close to the constitutions they have described and released but could be completely wrong."
S.K.'s comment: I don't buy this attempt to convict Anthropic for training Claude on a private Constitution differing from the public one.
^
S.K.'s footnote: Ironically, I did sketch such a scenario on July 2.

Discuss

Why are FICO scores effective?

19 ноября, 2025 - 00:53

Published on November 18, 2025 9:53 PM GMT

FICO scores, by far the most popular credit score system, seem to be set up in a strange way. Their purpose is to measure the quality of a loan application, and yet their methodology seems quite suboptimal to do that.

From Patrick Mackenzie (Bits About Money):

FICO scores are unreasonably effective. Many, many, many teams have thought “I bet I can get better loss rates if I supplement FICO scores with another data source”, and just about the only data sources for which that is actually true are illegal to use.

And yet, if you look at the calculation of these scores, it basically is this:

Payment history (35%)
Amount owed (30%)
Length of credit history (15%)
New credit (10%)
Credit mix (10%)

Basically, this seems to just be

Have they paid loans back in the past (Payment History)
Will they pay loans back in the future
1. Will they be unable to pay their loans back in the future (Amount owed, Length of credit history)
2. Are they getting a lot of credit with the intention to not pay it back (Amount owed, Length of credit history, New credit, Credit mix)

1 is obvious, and likely why it is a ~third of the score. However, this is clearly not unreasonably effective.

2a seems less effective than a bank worker directly looking at a customer's financial situation: these factors are basically a way to check someone's financial health without access to their income, as that is much messier.

2b seems to not be important enough to make up a significant portion of the score. The risk from credit not intended to be repaid is separate from risk accounted for via past loan delinquency base rates and future changes in financial situations, mostly as a separate, rare-but-consequential event. I don't think that adding the two tells you a lot about the person.

I think the most probable answers are the top-level bullets below, from most to least likely:

These factors are finding something else
The information about these scores is wrong
FICO scores are calculated in a way other than what you would expect (adding the factors together, weighted by percent above and mapping to values between 300-850)
FICO scores are less effective than I believe they are
FICO scores measure something more intrinsic about a person (like general trustworthiness)
- Why does credit age go into account here?
  - Is this just a legal way to account for older people being more trustworthy/having more to lose?
- What about people goodharting on these
  - Is doing that actually evidence that they are trustworthy/more likely to repay loans?
  - Is 2b a way to counteract goodharting?
    - People still seem to be able to goodhart credit scores, however.
      - The methods used for this might only work on some portions, and goodharting could prove that you aren't untrustworthy because of a negative in this
        Example: Credit Mix could mostly just be a way to measure how well the person understands credit, and goodharting your credit score shows you understand credit
FICO scores just work by getting easy access to a lot of data
- Why are they useful? Creditors already use a lot of data to make decisions about lending.

Essentially, FICO scores do not seem to be made with a special process. How can they be especially good data?

Discuss

The Aura of a Dark Lord

18 ноября, 2025 - 23:07

Published on November 18, 2025 8:07 PM GMT

I briefly mentioned the concept of the Aura of a Dark Lord while writing about publicly admitting your least-convenient motivations. At least one person mistook it for eighth-grader syndrome, which is not what it's about. Mostly. I think the following HPMoR quote explains it quite well:

Step into the role of a savior out of plays, and people see you as a slave to whose services they are entitled and whom it is their enjoyment to criticize; for it is the privilege of masters to sit back and call forth helpful corrections while the slaves labor. Only in the tales of the ancient Greeks, from when men were less sophisticated in their delusions, may you see the hero who is also high. Hector, Aeneas, those were heroes who retained their right of vengeance upon those who insulted them, who could demand gold and jewels in payment for their services without sparking indignation. And if Lord Voldemort conquered Britain, he might then condescend to show himself noble in victory; and nobody would take his goodwill for granted, nor chirp corrections at him if his work was not to their liking.

-Yudkowsky, HPMoR, Chapter 108

Yudkowsky discusses it more, speaking of dominance contests and respect. I don't think those are all that important; the most useful part is displaying a rare and powerful behavior: nonconformity. After all, good deeds do not count if they're done out of obligation instead of free will. Just like doing good for monetary compensation doesn't count, and social conformity is too close to that. I think the following Bible quote, while being mostly about enforcing the social taboo of converting money to virtue points, also highlights the principle of doing good because you're a good person instead of external motivations, if we reverse the arrows of causality:

[2] “So when you give to the needy, do not announce it with trumpets, as the hypocrites do in the synagogues and on the streets, to be honored by others. Truly I tell you, they have received their reward in full. [3] But when you give to the needy, do not let your left hand know what your right hand is doing, [4] so that your giving may be in secret. Then your Father, who sees what is done in secret, will reward you.

-Various authors, The Bible, Matthew 6:2-4

There are many different ways a Dark Lord can insulate themselves from attacks of social conformity. Usually it's a good practice to mix multiple methods, but they all have downsides too, and care must be taken. The most straightforward method is looking like you don't care about whatever problems there are. The downside being that any serious signaling requires actions and not only talk. If you do in fact care, then acting like you don't will make you feel bad and/or cause you to not care. At minimum you need plausible deniability, so that any good act can be explained from a purely selfish perspective.

An alternative way, possibly less personality-damaging but definitely harder to pull off, is the persona of a game (or decision) theorist. An aura of competence and ironclad clarity. "Shut up and multiply", they say. "I don't respond to decision theoretic threats, even when the whole society is making them". Doing this requires you to actually have some values, and since I'm still mostly working with the first option, I cannot seriously discuss these.

One might ask why such drastic measures are required? Isn't it ok to be kind and caring? Friendships based on mutual trust and generosity instead of careful game theoretic cooperation ought to be more fun, and even having to think about the incentives around everything makes it a game. Sure, you can enjoy the game but then you cannot enjoy the underlying friendship anymore, if you can even call it friendship anymore at that point.

I think the primary motivator is Slack Tax:

Governments and other bandits look for wealth and take it. Sometimes those bandits are your friends, family and neighbors. A little giving back is a good thing, but in many cultures demands for help and redistribution rapidly approach 100% – life is tough, and your fellow tribe members, or at least family members, are endless pits of need, so any wealth that can be given away must be hidden if you want to remain in good standing. Savings, security and investment in anything but status are all but impossible. There is no hope for prosperity.

Lastly, plausible deniability is really important. If you give mixed-enough signals, you can let everyone think you're just putting up an act of evilness. After all, it's impossible to tell for sure if someone is a skilled manipulator or genuinely acting out their true desires. The best manipulator might not know it themselves. I recommend sprinkling in a bit of self-deprecating humor and meta-contrarianism.

After all, if your goal is to have purely selfish fun, you shouldn't mind if others have fun too.

Discuss

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

18 ноября, 2025 - 22:09

Published on November 18, 2025 7:09 PM GMT

TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards artificial superintelligence. The agreement is centered around limiting the scale of AI training, and restricting certain AI research.

Experts argue that the premature development of artificial superintelligence (ASI) poses catastrophic risks, from misuse by malicious actors, to geopolitical instability and war, to human extinction due to misaligned AI. Regarding misalignment, Yudkowsky and Soares’s NYT bestseller If Anyone Builds It, Everyone Dies argues that the world needs a strong international agreement prohibiting the development of superintelligence. This report is our attempt to lay out such an agreement in detail.

The risks stemming from misaligned AI are of special concern, widely acknowledged in the field and even by the leaders of AI companies. Unfortunately, the deep learning paradigm underpinning modern AI development seems highly prone to producing agents that are not aligned with humanity’s interests. There is likely a point of no return in AI development — a point where alignment failures become unrecoverable because humans have been disempowered.

Anticipating this threshold is complicated by the possibility of a feedback loop once AI research and development can be directly conducted by AI itself. What is clear is that we're likely to cross the threshold for runaway AI capabilities before the core challenge of AI alignment is sufficiently solved. We must act while we still can.

But how?

In our new report, we propose an international agreement to halt the advancement towards superintelligence while preserving access to current, beneficial AI applications. We don’t know when we might pass a point of no return in developing superintelligence, and so this agreement effectively halts all work that pushes the frontier of general AI capabilities. This halt would need to be maintained until AI development can proceed safely; given the immature state of the field and the relative opacity of the large neural networks favored by the current paradigm, this could mean decades.

Our proposed agreement centers on a coalition led by the United States and China to restrict the scale of AI training and dangerous AI research. The framework provides the necessary assurance to participants that restrictions are being upheld within each jurisdiction; the expectation is not that participants would blindly trust each other. Participants would employ verification mechanisms to track AI chip inventories and how they are being used. Monitoring and enforcement would leverage existing state assets and legal frameworks, following the precedent of international arms treaties and non-proliferation agreements.

Under the agreement, training runs for new AIs would be limited by the total number of computational operations used. (We suggest 1022 FLOP as a threshold for monitoring, and 1024 FLOP as a strict upper limit.) Aiding verification is the fact that AI chips are expensive, specialized, and needed in the thousands for frontier development. The supply chain for AI chips also contains a number of key bottlenecks, simplifying initiatives to control and track new production.

Coalition members would each consolidate AI chips into a smaller number of declared data centers, where their usage can be declared and monitored for assurance that they are only being used for allowed activities. The number of chips permitted in any one unmonitored facility would be strictly limited. (We suggest the equivalent of 16 H100 chips, a conglomeration that would cost approximately $500,000 USD in 2025).

Because AI progress can unfold rapidly and unpredictably, the framework includes restrictions on research that could advance toward artificial superintelligence or endanger the agreement’s verifiability. The number of people with relevant skills is likely only in the thousands or tens of thousands, and we are hopeful these research restrictions can be narrow enough to only negligibly affect fields outside of AI.

A responsible coalition will need to extend its vigilance beyond the borders of its signatories. Dangerous AI development by anyone anywhere threatens everyone everywhere. The coalition must therefore act as needed to ensure cooperation from non-signatories, while incentivizing them to join the coalition. A natural incentive would be access to the AI infrastructure and usage permitted under the monitoring regime. Stronger incentives could come from the standard toolkit of international diplomacy, including economic sanctions and visa bans.

While political obstacles exist to forming such a coalition today, we anticipate a growing awareness that accepting even a 10% chance of extinction (to quote a figure popular among researchers) is wholly inconsistent with how we manage other risks. In an appendix we discuss how an agreement like this could come about in stages as political will grows over time.

The coalition’s task is likely to be easier the sooner it gets started. Rapid feedback loops, hardware proliferation, and the loosening of supply chain bottlenecks all become more likely over time.

In the full report, we address a number of common questions about our recommendations, including why we think less costly plans likely wouldn’t work and why we believe a halt should be architected to last for decades, should we need that much time. We consider this work to be ongoing, and are excited for folks to engage with the details and help us improve it.

For those who prefer listening over reading, an earlier version of the agreement was discussed in a FAR Seminar talk, available on YouTube.

In follow-up posts, we plan to explore additional concerns around potential circumventions by signatories and by groups beyond their jurisdiction. We’ll also explain some of the thinking behind our proposed compute thresholds, consider the threat of authoritarianism enabled by the agreement, compare our proposal to other international arrangements, and provide additional policy recommendations.

If you are interested in this work and want to join our team, we are usually hiring researchers.

Discuss

Sign language as a generally-useful means of communication (even if you have good hearing)

18 ноября, 2025 - 21:34

Published on November 18, 2025 6:34 PM GMT

Note: I'm writing every day in November, see my blog for disclaimers.

Sign languages[1] are interesting to me because they use a fundamentally different medium of communication: sight instead of sound. Generally, sign languages are ignored by those who can speak and who don’t have to communicate with deaf people. But I believe they have utility beyond communicating with deaf people because it’s common to be in a place where you want to communicate but it’s either too loud or too far to do so via speech. Some examples:

In a coworking space or library where the room is silent, you could use sign language to ask a question to someone nearby, or to ask a question across the room without disturbing anyone
In a meeting or during a talk you could similarly ask & answer questions without interrupting the meeting. This might be particularly useful to check in with peers in real time, as the meeting is happening
high-end service workers or organisers could use a sign language to convey logistical information (“we’re running out of X”, “is the next speaker ready?”, etc) without disrupting the event that they’re facilitating
Across a large distances such as a football field or from atop tall buildings, you could use a sign language to communicate without having to walk within earshot (although admittedly the clarity of the signs will degrade with distance)
In a bar, you could use sign language to tell your order to the bartender without having to shout above the music

Beyond sign languages utility in isolation, sign language also has utility in that it can provide redundancy to the spoken word. This is useful when communicating important information in noisy environments, such as the size of a bill at a loud restaurant, or possibly just for emphasis of an important point.

As far as I see, the main downside to learning a sign language is the lack of people to speak with! This seems like a fun hobby for a couple or friend group to pick up, since I imagine there’d be benefit if the people you see most often all speak a sign language together. I also imagine that I’m missing some benefits that only become apparent after actually using sign language as an auxiliary means of communication for some time.

^
there are multiple different sign languages

Discuss

On Writing #2

18 ноября, 2025 - 20:30

Published on November 18, 2025 5:30 PM GMT

In honor of my dropping by Inkhaven at Lighthaven in Berkeley this week, I figured it was time for another writing roundup. You can find #1 here, from March 2025.

I’ll be there from the 17th (the day I am publishing this) until the morning of Saturday the 22nd. I am happy to meet people, including for things not directly about writing.

Table of Contents

How I Use AI For Writing These Days

How have I been using AI in my writing?

Directly? With the writing itself? Remarkably little. Almost none.

I am aware that this is not optimal. But at current capability levels, with the prompts and tools I know about, in the context of my writing, AI has consistently proven to have terrible taste and to make awful suggestions, and also to be rather confident about them. This has proven sufficiently annoying that I haven’t found it worth checking with the AIs.

I also worry about AI influence pushing me towards generic slop, pushing me to sounding more like the AIs, and rounding off the edges of things, since every AI I’ve tried this with keeps trying to do all that.

I am sure it does not help that my writing style is very unusual, and basically not in the training data aside from things written by actual me, as far as I can tell.

Sometimes I will quote LLM responses in my writing, always clearly labeled, when it seems useful to point to this kind of ‘social proof’ or sanity check.

The other exception is that if you ask the AI to look for outright errors, especially things like spelling and grammar, it won’t catch everything, but when it does catch something it is usually right. When you ask it to spot errors of fact, it’s not as reliable, but it’s good enough to check the list. I should be making a point of always doing that.

I did the ‘check for errors and other considerations’ thing on this piece in particular with both Sonnet and 5.1-Thinking. This did improve the post but it’s not obvious it improved it enough to be worth the time.

I will also sometimes ask it about a particular line or argument I’m considering, to see if it buys it, but only when what I care about is a typical reaction.

If I was devoting more time to refining and editing, and cared more about marginal improvements there, that would open up more use cases, but I don’t think that’s the right use of time for me on current margins versus training on more data or doing more chain of thought.

Indirectly? I use it a lot more there, and again I could be doing more.

There are some specific things:

I have a vibe coded Chrome extension that saves me a bunch of trouble, that could be improved a lot with more work. It does things like generate the Table of Contents, crosspost to WordPress, auto-populate many links and quickly edit quotes to fix people’s indifference to things like capitalization.
I have a GPT called Zvi Archivist that I use to search through my past writing, to check if and when I’ve already covered something and what I’ve said about it.
I have a transcriber for converting images to text because all the websites I know about that offer to do this for you are basically broken due to gating. This works.

Then there’s things that are the same as what everyone does all the time. I do a lot of fact checking, sanity checking, Fermi estimation, tracking down information or sources, asking for explanations, questioning papers for the things I care about. Using the AI assistant in its classic sense. All of that is a big help and I notice my activation requirement to do this is higher than it should be.

Influencing Influence

I want this to be true so I’m worried I can’t be objective, but it seems true to me?

Janus: i think that it’s almost always a bad idea to attempt to grow as an influencer on purpose.

you can believe that it would be good if you were to grow, and still you shouldn’t optimize for it.

the only way it goes well is if it happens while you optimize for other things.

More precisely than “you shouldn’t on purpose” what I’m saying is you shouldn’t be spending significant units of optimization on this goal and performing actions you wouldn’t otherwise for this purpose

I am confident that if you optimize primarily for influence, that’s full audience capture, slopification and so on, and you’ve de facto sold your soul. You can in theory turn around and then use that influence to accomplish something worthwhile, but statistically speaking you won’t do that.

Janus: Name a single account that explicitly optimizes for being a bigger influencer / “tries to grow” (instead of just happening as a side effect) and that does more good than harm to the ecosystem and generally has good vibes and interesting content

You probably can’t!

actually, https://x.com/AISafetyMemes is a contender

but i know they’re VERY controversial and I do think they’re playing with fire

i do consider them net positive but this is mostly bc they sometimes have very good taste and maybe cancel out the collateral damage

but WOULD NOT RECOMMEND almost anyone trying this, lol

AISafetyMemes is definitely an example of flying dangerously close to the sun on this, but keeping enough focus and having enough taste to maybe be getting away with it. It’s unclear that the net sign of impact there is positive, there are some very good posts but also some reasons to worry.

Size Matters

No one reads the blog posts, they’re too long, so might as well make them longer?

Visakan Veerasamy: An idea I’ve been toying with and discussed with a couple of friends is the idea that blog posts could and probably should get much longer now that fewer people are reading them.

One of the difficult things about writing a good essay is figuring out what to leave out so it is more manageable for readers.

But on a blog where there is no expectation that anybody reads it, you do not have to leave anything out.

My guess is this is going to end up being a barbell situation like so many other things. If you cut it down, you want to cut it down as much as possible. If you’re going long, then on the margin you’re better off throwing everything in.

Time To Write A Shorter One

I highlight this exactly because it seems backwards to me. I notice that my experience is very much the opposite – when I want to write a good short piece it is MUCH more work per token, and often more total work.

Timothy Lee: I think a big reason that writing a book is such a miserable experience is that the time to write a good piece is more-than-linear in the number of words. A good 2,000-word piece is a lot more than 4x the work of a good 500-word piece.

I assume this continues for longer pieces and a good 100,000 book is a lot more than 50x the work of a good 2,000-word article. Most authors deal with this by cutting corners and turning in books that aren’t very good. And then there’s Robert Caro.

Josh You: I think by “good 2000 word piece” Tim means “a 2000 word piece that has been edited down from a much longer first draft”

Even then. Yes, a tight longer piece requires more structure and planning, but the times I write those 500-800 word pieces it takes forever, because you really do struggle over every word as you try to pack everything into the tiniest possible space.

Writing a 100,000 word book at the precision level of an 800 word thinkpiece would take forever, but also I presume it almost never happens. If it does, that better be your masterpiece or I don’t see why you’d do it.

A Useful Tool

Dwarkesh Patel is using the Smart Composer Plugin for Obsidian, which he says is basically Cursor for writing, and loves it. Sounds great conditional on using Obsidian, but it is not being actively maintained.

A Maligned Tool

Eric Raymond joins ‘the em-dash debate’ on the side of the em-dash.

Eric Raymond (yes that one): My wacky theory about the em-dash debate:

Pro writers use em-dashes a lot because many of them, possibly without consciously realizing it, have become elocutionary punctuationists.

That is, they’ve fallen into the habit of using punctuation not as grammatical phrase structure markers but as indicators of pauses of varying length in the flow of speech.

The most visible difference you see in people who write in this style that their usage of commas becomes somewhat more fluid — that’s the marker for the shortest pause. But they also reach for less commonly used punctuation marks as indicators of longer pauses of varying length.

Em dash is about the second or third longest pause, only an ellipsis or end-of-sentence period being clearly longer.

Historical note: punctuation marks originally evolved as pause or breathing markers in manuscripts to aid recitation. In the 19th century, after silent reading had become normal, they were reinterpreted by grammarians as phrase structure markers and usage rules became much more rigid.

Really capable writers have been quietly rediscovering elocutionary punctuation ever since.

RETVRN!

I too have been increasingly using punctuation, especially commas, to indicate pauses. I still don’t use em dashes, partly because I almost never want that exact length and style of a pause for whatever reason, and also because my instinct is that you’re trying to do both ‘be technically correct’ and also ‘evoke what you want’ and my brain thinks of the em-dash as technically incorrect.

That’s all true and I never used em-dashes before but who are we kidding, the best reason not to use em-dashes is that people will think you’re using AI. I don’t love that dynamic either, but do you actually want to die on that hill?

Neglected Topics

Tyler Cowen lists some reasons why he does not cover various topics much. The list resonates with me quite a bit.

I feel that writing about the topic will make me stupider.
I believe that you reading more about the topic will make you stupider.
I believe that performative outrage usually brings low or negative returns. Matt Yglesias has had some good writing on this lately.
I don’t have anything to add on the topic. Abortion and the Middle East would be two examples here.
Sometimes I have good inside information on a topic, but I cannot reveal it, not even without attribution. And I don’t want to write something stupider than my best understanding of the topic.
I just don’t feel like it.
On a few topics I feel it is Alex’s province.

I don’t have an Alex, instead I have decided on some forms of triage that are simply ‘I do not have the time to look into this and I will let it be someone else’s department.’

Otherwise yes, all of these are highly relevant.

Insider information is tough, and I am very careful about not revealing things I am not supposed to reveal, but this rarely outright stops me. If nothing else, you can usually get net smarter via negativa, where you silently avoid saying false things, including by using careful qualifiers on statements.

One big thing perhaps missing from Tyler’s list is that I avoid certain topics where my statements would potentially interfere with my ability to productively discuss other topics. If you are going to make enemies, or give people reasons to attack you or dismiss you, only do that on purpose. One could also file this under making you and others stupider. Similarly, there are things that I need to not think about – I try to avoid thinking about trading for this reason.

A minor thing is that I’d love to be able to talk more about gaming, and other topics dear to my heart, but that consistently drive people away permanently when I do that. So it’s just not worth it. If the extra posts simply had no impact, I’d totally do it, but as is I’d be better off writing the post and then not hitting publish. Sad. Whereas Tyler has made it very clear he’s going to post things most readers don’t care about, when he feels like doing so, and that’s part of the price of admission.

The Humanities Don’t Seem Relevant To Writing About Future Humanity?

If you want to write or think about the future, maybe don’t study the humanities?

Startup Archive: Palmer Luckey explains why science fiction is a great place to look for ideas

“One of the things that I’ve realized in my career is that nothing I ever come up with will be new. I’ve literally never come up with an idea that a science fiction author has not come up with before.”

Dr. Julie Gurner: Funny how valuable those English majors and writers truly are, given how much liberal arts has been put down. Why philosophy, creativity and hard tech skills make such fantastic bedfellows. Span of vision wins.

Orthonormalist: Heinlein was an aeronautical engineer.

Asimov was a biochemistry professor.

Arthur Clarke was a radio operator who got a physics degree.

Ray Bradbury never went to college (but did go straight to being a writer)

I quote this because ‘study the humanities’ is a natural thing to say to someone looking to write or think about the future, and yet I agree that when I look at the list of people whose thinking about the future has influenced me, I notice essentially none of them have studied the humanities.

Writing Every Day

Alan Jacobs has a very different writing pattern. Rather than write every day, he waits until the words are ready, so he’ll work every day but often that means outlines or index card reordering or just sitting in his chair and thinking, even for weeks at a time. This is alien to me. If I need to figure out what to write, I start writing, see what it looks like, maybe delete it and try again, maybe procrastinate by working on a different thing.

Writing As Deep Work

Neal Stephenson explains that for him writing is Deep Work, requiring long blocks of reliably uninterrupted time bunched together, writing novels is the best thing he does, and that’s why he doesn’t go to conferences or answer your email. Fair enough.

I’ve found ways to not be like that. I deal with context shifts and interruptions all the time and it is fine, indeed when dealing with difficult tasks I almost require them. That’s a lot of how I can be so productive. But the one time I wrote something plausibly like a book, the Immoral Mazes sequence, I did spend a week alone in my apartment doing nothing else. And I haven’t figured out how to write a novel, or almost any fiction at all.

Also, it’s rather sad if it is true that Neal Stephenson only gets a middle class life out of writing so many fantastic and popular books, and can’t afford anyone to answer his email. That makes writing seem like an even rougher business than I expected. Although soon AI can perhaps do it for him?

Most Of Your Audience Is Secondhand

Patrick McKenzie highlights an insight from Alex Danco, which is that most of the effective audience of any successful post is not people who read the post, but people who are told about the post by someone who did read it. Patrick notes this likely also applies to formal writing, I’d note it seems to definitely apply to most books.

Relatedly, I have in the past postulated a virtual four-level model of flow of ideas, where each level can understand the level above it, and then rephrase and present it to the level below.

So if you are Level 1, either in general or in an area, you can formulate fully new ideas. If you are Level 2, you can understand what the Level 1s say, look for consensus or combine what they say, riff on it and then communicate that to those who are up to Level 3, who can then fully communicate to the public who end up typically around at Level 4.

Then the public will communicate a simplified and garbled version to each other.

You can be Level 1 and then try to ‘put on your Level 2 or 3 hat’ to write a dumber, simpler version to a broader audience, but it is very hard to simultaneously do that and also communicate the actual concepts to other Level 1s.

These all then interact, but if you go viral with anything longer than a Tweet, you inevitably are going to primarily end up with a message primarily communicated via (in context) Level 3 and Level 4 people communicating to other Level 3-4 people.

At that point, and any time you go truly viral or your communication is ‘successful,’ you run into the You Get About Five Words problem.

My response to this means that at this point I essentially never go all that directly viral. I have a very narrow range of views, where even the top posts never do 100% better than typical posts, and the least popular posts – which are when I talk about AI alignment or policy on their own – will do at worst 30% less than typical.

The way the ideas go viral is someone quotes, runs with or repackages them. A lot of the impact comes from the right statement reaching the right person.

I presume that would work differently if I was working with mediums that work on virality, such as YouTube or TikTok, but my content seems like a poor fit for them, and when I do somewhat ‘go viral’ in such places it is rarely content I care about spreading. Perhaps I am making a mistake by not branching out. But on Twitter I still almost never go viral, as it seems my speciality is small TAM (total available market) Tweets.

That’s Funny

Never have a character try to be funny, the character themselves should have no idea.

I think this is directionally correct but goes too far, for the same reasons that you, in your real life, will often try to be funny, and sometimes it will work. The trick is they have to be trying to be funny in a way that makes sense for the character, in context, for those around them, not trying to be funny to the viewer.

I notice that in general I almost never ‘try to be funny,’ not exactly. I simply say things because they would be funny, and to say things in the funniest way possible, because why not. A lot of my favorite people seem to act similarly.

Fiction Writing Advice

Lydia Davis offers her top ten recommendations for good (fiction?) writing: Keep notes, including sentences out of context, work from your own interest, be mostly self-taught, read and revise the notes constantly, grow stories or develop poems out of those notes, learn techniques from great works and read the best writers across time.

Orson Scott Card explains that you don’t exhaust the reader by having too much tension in your book, you exhaust them by having long stretches without tension. The tension keeps us reading.

Just Say The Thing

Dwarkesh Patel: Unreasonably effective writing advice:

“What are you trying to say here?

Okay, just write that.”

I’ve (separately) started doing this more.

I try to make sure that it’s very easy to find the central point, the thing I’m most trying to say, and hard to miss it.

Patrick McKenzie: Cosigned, and surprisingly effective with good writers in addition to ones who more obviously need the prompting.

Writing an artifact attaches you to the structure of it while simultaneously subsuming you in the topic. The second is really good for good work; the first, less so.

One thing that I tried, with very limited success, to get people to do is to be less attached to words on a page. Writing an essay? Write two very different takes on it; different diction, different voice, maybe even different argument. Then pick the one which speaks to you.

Edit *that* rather than trying to line edit the loser towards greatness.

There is something which people learn, partially from school and partially from work experience, which causes them to write as if they were charged for every word which goes down on the page.

Words are free! They belong in a vast mindscape! You can claw more from the aether!

I think people *might* operationalize better habits after LLMs train them that throwing away a paragraph is basically costless.

Jason Cohen: Yeah this works all the time.

Also when getting someone to explain their product, company, customer, why to work for them, etc..

So funny how it jogs them out of their own way!

BasedBigTech: An excellent Group PM reviewed my doc with me. He said “what does this mean?” and I told him.

“Then why didn’t you write that?”

Kevin Kelly: At Whole Earth Review people would send us book reviews with a cover letter explaining why we should run their book review. We’d usually toss the review and print their much shorter cover letter as the review which was much clearer and succinct.

Daniel Eth: It’s crazy how well just straight up asking people that gets them to say the thing they should write down

Why does it work?

The answer is that writing is doing multiple tasks.

Only one of them is ‘tell you what all this means.’

You have to do some combination of things such as justify that, explain it, motivate it, provide details, teach your methods and reasoning, perform reporting, be entertaining and so on.

Also, how did you know what you meant to say until you wrote the damn thing?

You still usually should find a way to loudly say what it all means, somewhere in there.

But this creates the opportunity for the hack.

If I hand you a ten-page paper, and you ask ‘what are you trying to say?’ then I have entered into evidence that I have Done the Work and Written the Report.

Now I can skip the justifications, details and context, and Say The Thing.

The point of a reference post is sometimes to give people the opportunity to learn.

The point of a reference post can also be to exist and then not be clicked on. It varies.

This is closely related to the phenomenon where often a movie or show will have a scene that logically and structurally has to exist, but which you wish you didn’t have to actually watch. In theory you could hold up a card that said ‘Scene in which Alice goes to the bank, acts nervous and get the money’ or whatever.

Probably they should do a graceful version of something like that more often, or even interactive versions where you can easily expand or condense various scenes. There’s something there.

Similarly, with blog posts (or books) there are passages that are written or quoted knowing many or most people will skip them, but that have to be there.

Cracking the Paywall

Aella teaches us how to make readers pay up to get behind a Paywall. Explain why you are the One Who Knows some valuable thing, whereas others including your dear reader are bad at this and need your help. Then actually provide value both outside and inside of the paywall, ideally because the early free steps are useful even without the payoff you’re selling.

I am thankful that I can write without worrying about maximizing such things, while I also recognize that I’m giving up a lot of audience share not optimizing for doing similar things on the non-paywall side.

Discuss

ARC progress update: Competing with sampling

18 ноября, 2025 - 20:22

Published on November 18, 2025 5:22 PM GMT

In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I've seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications.

The purpose of this post is to:

Explain and motivate our "outperforming sampling" agenda from the standpoint of preventing catastrophic AI misalignment.
Introduce what we call the matching sampling principle (MSP) as a semi-formalization of the belief underpinning our research agenda, and discuss why we believe this principle.
Discuss the progress we've made toward matching sampling in some specific contexts, such as random MLPs and trained two-layer MLPs.

Also: we're hiring! If the research direction described in this post excites you, you can apply to ARC!

Outperforming sampling as a step toward preventing AI misalignment

Consider the following simple scheme that attempts to align an AI model M.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , which maps inputs x to outputs y:

Build a "catastrophe detector" C that classifies model outputs y as "catastrophic" (1) or "non-catastrophic" (0).[1] You can do this by, for example, scaffolding together a deliberative system of GPT-5's that carefully investigate whether the model is doing anything suspicious.
Do adversarial training using the catastrophe detector. Concretely, this means:
1. Optimize a probability distribution D over inputs x, so as to maximize Ex∼D[C(M(x))].
2. At the same time, optimize M so as to minimize Ex∼D[C(M(x))].

I think that this is a fine starting point for alignment plan, but not a complete plan in and of itself. It suffers from at least two issues:

A catastrophe detector that's built in the way I described would be imperfect, even if you did a really good job with the engineering. If M were "smarter" than C, then it could figure out ways to fool C into assigning a low catastrophe probability to a catastrophic output.
Even if the catastrophe detector were perfect, it is not clear how to compute Ex∼D[C(M(x))] efficiently enough to make this scheme work.[2] The most straightforward approach would involve estimating this expectation by taking random samples x∼D, but this is extremely slow if catastrophes are rare on D.

(There are other technical issues as well,[3] but these are the ones that seem hardest to surmount.)

We believe that ARC's technical research agenda is capable of addressing both of these issues. However, issue #1 is mostly out of scope for this post (though I'll very briefly describe our planned approach in this footnote[4]). The purpose of this post is to explain in detail how we hope to address issue #2.

Understanding structure helps outperform sampling

ARC's goal is to be able to estimate Ex∼D[C(M(x))] far better than one can just by drawing random samples from D. We believe that this can be done by understanding the structure of M, C, and D.

Here's a simple example that's meant to illustrate this point. Suppose that, by understanding the internals of C, we are able to notice that C(y) is the conjunction of three predicates P1(y),P2(y),P3(y) -- in other words, C outputs 1 if and only if all of P1,P2,P3 are true of M's output.

And suppose that, furthermore, we understand the structure of M and D to understand that P1(M(x)),P2(M(x)),P3(M(x)) are independent events for x∼D.[5]

Using this structural understanding, we can estimate that Ex∼D[C(M(x))]=∏3i=1Ex∼D[Pi(M(x))]. If each Pi is true on D with probability one-in-a-million (10−6), then we can estimate that Ex∼D[C(M(x))]≈10−18. Obtaining this estimate by sampling would have required roughly 1018 samples. By contrast, our structural understanding lets us estimate this probability with roughly 3⋅106 samples, and we can potentially do even better than that if we have a structural understanding of the Pi's themselves. Thus, our structural understanding lets us estimate the expected value far more efficiently than we could with sampling.

This example is simplistic, of course: in practice, we will need to understand structure that is far more sophisticated than "the output is a conjunction of three independent predicates." But the example illustrates the point that having a detailed mechanistic understanding of a neural net lets us estimate properties of its outputs far better than black-box methods alone.

A non-human understanding

When we speak of "understanding the structure" of M, C, and D, we are not referring to human understanding. While a conjunctive structure like the one above can be understood by a human, we believe that in general, neural nets will be composed of mathematical structures that are far too complex to allow for a full human understanding.

Instead, we are imagining that an explanation of the structure of a neural net is written in some kind of formal language. The explanation could be as large as the neural net itself, and may be as incomprehensible to a human as the neural net. Thus, our goal is not to have a human look at the structure and estimate the expectation of C(M(x)). Instead, the goal is to invent an algorithm that takes as input the explanation and estimates expectation of C(M(x)) based on that explanation.

This contrasts to almost all other research on neural network interpretability, which aims for a partial, human understanding of neural nets. Our research is instead aimed at full, algorithmic understanding.

In the next section, I will elaborate on what this means.

The matching sampling principle

In this section, I will more formally describe what we hope to accomplish by gaining structural understanding of neural networks. While above I talked about outperforming sampling, in the fully general case we can only hope to match the performance of sampling. In other words, we expect that the performance of our algorithms in the practical setting of trained neural nets will substantially exceed the worst-case bounds that we will be able to state and prove. See below for further discussion of this point.

I will start with a first-pass attempt at stating the matching sampling principle (MSP). As we will discuss, it does not quite make sense; however, it carries across the key intuition.

A first attempt at stating the MSP

In order to state the MSP, we will define a few pieces of notation:

We will use the notation Mθ to describe a neural network (or other function from a parameterized family). Here, M denotes the architecture of the neural net, and θ denotes the parameters.
- Concretely, Mθ:{0,1}n→R is a function mapping n-bit inputs x to real numbers. (Note that this notation differs from the notation we used in the previous section; here, Mθ describes the composition C∘M of the neural net and the catastrophe detector above.)
- We are interested in estimating Ex∼{0,1}n[Mθ(x)], where x is sampled uniformly.[6]
We will use the notation π to describe a mechanistic explanation of Mθ; π is what provides us with a structural understanding of Mθ and allows us to estimate its expected output. (This mirrors the way that ARC has historically used the letter π.)
We will use the notation GM to denote an estimator (the subscript M is meant to emphasize that different architectures could have different estimators): GM takes as input the parameters θ, a mechanistic explanation π, and a tolerance parameter ε. (As we discuss below, the smaller the tolerance parameter, the more accurate GM's estimate, but the longer GM will be allowed to run.)

With this notation in place, we will make our first attempt to state the MSP:

For all architectures M (with parameters θ), there exists an estimator GM such that:
For all parameters θ, there exists a short[7] explanation π (we require that |π|≤O(|θ|)), such that:
For all tolerance parameters 0">ε>0, GM(θ,π,ε) satisfies the following three properties:
1. It runs in time O(1ε2Time(Mθ)).
2. Its error is competitive with sampling:(GM(θ,π,ε)−Ex∼{0,1}n[Mθ(x)])2≤ε2Varx∼{0,1}n[Mθ(x)].
3. It is mechanistic.

Let's parse these three requirements:

GM needs to run in time O(1ε2Time(Mθ)). This is the time that it takes to estimate Ex∼{0,1}n[Mθ(x)] by running 1/ε2 randomly sampled values of x through Mθ.
The squared error of GM must be small. Concretely, the right-hand side represents the expected squared error via taking the empirical average of 1/ε2 samples of Mθ(x).
GM is mechanistic. We have not defined "mechanistic", so this point requires elaboration.

What makes an algorithm "mechanistic"?

We do not have a formal definition of "mechanistic." But, loosely speaking, we mean that GM estimates the expected output of Mθ deductively, based on the structure of Mθ. This contrasts with sampling-based algorithms for estimating the expected output of Mθ, which operate based on inductive reasoning. Mechanistically estimating the expected output involves finding the reason for the expected output being what it is; meanwhile, sampling-based algorithms merely infer the existence of a reason without learning anything about the reason.

To illustrate this difference, suppose that the explanation π given to GM is a simple heuristic argument (such as mean propagation -- see §D.2 here), which suggests that E[Mθ]=0 but is otherwise uninformative about the structure of Mθ. Suppose further that GM computes Mθ(x) on a hundred inputs x∈{0,1}n, and it finds that Mθ(x)=1 on every one of those hundred inputs. Then GM should return 1002n: that's because it knows that Mθ(x)=1 on the hundred inputs that it checks, but it has not seen any structural evidence that would suggest that Mθ's behavior on those hundred inputs has any bearing on how Mθ behaves on the inputs that it has not checked. By contrast, a sampling-based estimator that checks the same hundred inputs would return 1, implicitly assuming that those inputs are representative.

(If indeed Mθ always returns 1, then we believe that there exists a short explanation of this fact; but GM cannot output 1 unless it is given this explanation.)

In some of our previous work, we discussed covariance propagation: successively modeling each layer of Mθ as a multivariate normal distribution.[8] Covariance propagation (and related methods, like mean propagation and cumulant propagation) is mechanistic, because it deduces an estimate based on the structure of Mθ.[9] More generally, deduction-projection estimators -- estimators that successively model each layer of Mθ by finding the best-fit model from some parameterized class -- are mechanistic.

A simple, though not entirely correct, heuristic for whether an estimation algorithm is deductive, is whether it avoids any random or pseudorandom sampling. This heuristic should work for the purposes of engaging with this post.

(See much more on mechanistic estimation in our earlier paper, "Formalizing the presumption of independence",[10] as well as in former ARC intern Gabe Wu's senior thesis on deduction-projection estimators.)

Why do we require GM to be mechanistic?

There are multiple reasons for this; in a previous blog post, we discussed how mechanistic estimates can help us detect mechanistic anomalies. But for the purposes of this post, the reason is pretty straightforward: in cases where Mθ has a lot of structure, we think that GM can substantially outperform sampling, if given an explanation π that explains that structure (as motivated above).

Thus, loosely speaking, our hope is that if we find a GM that both (a) is mechanistic and (b) performs at least as well as sampling for all θ, then it will substantially outperform sampling for parameters θ with a lot of structure, such as trained neural nets.

The intuition behind the MSP

Sampling is a really powerful tool, because randomly drawn samples are representative (with high probability), and so a sampling-based estimate can't be off by too much (with high probability). In light of this, why do we think that a mechanistic estimation algorithm can compete with sampling?

Suppose that Mθ:{0,1}100→{0,1} is a boolean circuit. Suppose, further, that a naive heuristic argument (like mean propagation) suggests that Mθ's average output is 0.5, but that in fact its average output is roughly 0.49 (far enough from 0.5 that this discrepancy could not have happened by chance). A sampling-based algorithm can pick up on this discrepancy given about 10,000 samples; but what can a mechanistic algorithm do?

Well, given that the discrepancy could not have happened by chance, there must be structure that explains the discrepancy. For illustration, let's consider two types of structure.

First, maybe the discrepancy is caused by different gates reusing the same inputs, thereby inducing nontrivial correlations between different parts of the circuit.[11] In that case, π should be able to point out this structure, causing GM to understand the discrepancy (even without running any inputs through Mθ).

Second, maybe only the first 10 input bits matter to the output of the circuit (perhaps Mθ ignores the last 90 input bits entirely, or perhaps they end up not affecting the output for complicated structural reasons). And then -- just by chance -- it so happens that Mθ outputs 1 on only 49% of the 1024 possible 10-bit inputs. In this case, π points out that Mθ depends only on the first 10 input bits; it does not point out that Mθ outputs 1 on 49% of them, because that's part of the unexplainable randomness of Mθ.[12] Instead, GM must determine this fact by using its allotted time to check the value of Mθ on those 1024 inputs.

(What if ε is large enough that GM doesn't have the necessary runtime to check Mθ on all 1024 inputs? In that case, it should check however many it can and estimate the rest as being 50/50. This will still outperform sampling!)

More generally, the intuition is that knowing the structure of Mθ gives GM the knowledge it needs to do no worse than random sampling. If GM still does worse than random sampling after reading π, that can only be because π did not provide a full structural explanation of Mθ.[13]

Why only matching sampling?

Given the above intuition that understanding structure can outperform sampling, why are we only aiming to match the performance of sampling?

Consider the above example, where the average output of Mθ depends on 1024 effectively random computations, and suppose that 1/ε2=512: enough time for GM to compute the output of Mθ on 512 of the 1024 inputs. In that case, we expect both GM and sampling to have squared error on the order of 1512: GM's expected squared error will be somewhat lower, but not dramatically so.

In general, we expect that there will often be a range of ε-values for which the best mechanistic estimate is only slightly better than sampling-based estimation.[14] Thus, for some parameters θ and tolerance parameters ε, we only expect to be able to match (or perhaps slightly outperform) sampling, not to strongly outperform sampling.

However, as discussed above, we expect that if our mechanistic estimator matches the performance of sampling in all cases, then it will substantially outperform sampling in structured cases such as trained neural nets, at least for non-tiny values of ε. We expect that we can leverage this to help with the sort of adversarial training process described in the introduction.

An issue: π can just tell GM the answer

As mentioned earlier, out first attempt at stating the MSP doesn't quite make sense. The idea of MSP is for π to describe the structure of Mθ. However, in order to satisfy the MSP statement above, π can just write down the value of Ex∼{0,1}n[Mθ(x)]. Then, GM can output that value.

To fix this issue, we observe that if GM understands the structure of Mθ, then it ought to be able to answer all sorts of questions about Mθ at least as well as sampling -- not just its expected value -- so long as those questions are not adversarially selected. To formalize this idea, we will modify the type signature of Mθ to take two inputs (c,x) (here, c stands for "context"), and require that GM be able accurately estimate Ex[Mθ(c,x)] for a random choice of c.[15] This change gives us an MSP statement that we are willing to stand behind.

Our actual MSP statement

Here is ARC's mainline "matching sampling principle" (MSP):

For all architectures M (with parameters θ) mapping pairs (c∈{0,1}nc,x∈{0,1}nx) to R, there exists an estimator GM mapping tuples (θ,π,c,ε) to R, such that:
For all parameters θ, there exists a short explanation π (|π|≤O(|θ|)), such that:
For all tolerance parameters 0">ε>0, GM(θ,π,c,ε) satisfies the following three properties:
1. It runs in time O(1ε2Time(Mθ)).
2. Its error is competitive with sampling, on average over random c:Ec[(GM(θ,π,c,ε)−Ex[Mθ(c,x)])2]≤ε2Ec[Varx[Mθ(c,x)]], where c∼{0,1}nc and x∼{0,1}nx.
3. It is mechanistic.

(Just as before, this statement isn't fully formal, because of the informal "mechanistic" qualifier. But in practice, we have strong enough opinions about what counts as "mechanistic" that this statement is formal enough to guide our research.)

An interesting special case of the MSP is when Mθ encodes a universal Turing machine. See the appendix for discussion.

An important variant: Findable explanations

While the above MSP statement is the most theoretically clean one, on its face the statement is not very useful. That's because it says nothing about being able to find the explanation π; what use is it to merely know that an adequate explanation exists, if we can't find it?

This leads us to the following alternative statement, which we've been calling the "train and explain" formulation of the MSP:

For all architectures M (with parameters θ) mapping pairs (c∈{0,1}nc,x∈{0,1}nx) to R, there exists an estimator GM mapping tuples (θ,π,c,ε) to R, such that:
For all "training" algorithms T mapping random seeds s∈{0,1}r to parameters θ, there exists an "explaining" algorithm E mapping random seeds s∈{0,1}r to explanations π, with Time(E)≤O(Time(T)), such that:
For all tolerance parameters 0">ε>0, GM(θ,π,c,ε) satisfies the following three properties:
1. It runs in time O(1ε2Time(Mθ)).
2. Its error is competitive with sampling, on average over random c and s:Ec,s[(GM(T(s),E(s),c,ε)−Ex[MT(s)(c,x)])2]≤ε2Ec,s[Varx[MT(s)(c,x)]], where s∼{0,1}r, c∼{0,1}nc, and x∼{0,1}nx.
3. It is mechanistic.

In this statement, it is useful to think of T as being the learning algorithm used to find θ (e.g. SGD) and s as being the random bits used during training (the random initialization and random choices of training data used at each training step). Then, E is the algorithm used to find the mechanistic explanation of θ: intuitively, it works "in parallel" with T, observing the training process and "building up" the explanation π in a way that mirrors the way that T "builds up" structure in θ by iteratively modifying θ to get lower and lower loss.

Note that our mainline MSP statement is the special case of the "train and explain" formulation where T and E are both computationally unbounded (so that T can select the worst parameters θ and E can select the most helpful advice π).

In general, we think that for any computational constraints placed on T (e.g. on time or memory), there is a corresponding E with the same computational constraints that can find an adequate explanation π. If we are correct, then that potentially gives us a strategy for efficiently computing properties of trained neural networks (such as catastrophe probability), while paying a relatively small alignment tax. (If finding π takes as much time as finding θ, that's an alignment tax of 100%: a small price to pay for avoiding catastrophe.)

Another modification: getting rid of ε

It turns out that the MSP can be stated without reference to a tolerance parameter, by subsuming the number of samples into the architecture instead. See this appendix for details.

Our progress so far

Over the course of 2025, ARC has made progress on the MSP in a few different directions. Concretely:

We have a mechanistic algorithm that competes with sampling (in theory and in practice) for estimating the size of an intersection of randomly chosen halfspaces.[16] We have also generalized our algorithm to work for some other problems, such as estimating the satisfaction probability of a random CNF, or the permanent of a random matrix.
We believe we have a mechanistic algorithm that competes with sampling for estimating the expected output of random MLPs on Gaussian inputs. (We have an empirical demonstration of competitiveness with sampling, and a proof sketch that we are working on expanding into a full proof.)
We have made substantial progress toward a mechanistic algorithm that competes with sampling for estimating the expected output of a two-layer MLP on Gaussian inputs, where the second layer of the MLP is trained via gradient descent.

Our results are not yet ready for publication, but hope to get them ready in the coming months. In this section, I will briefly summarize these results and discuss the most interesting directions for future work.

Intersection of random half-spaces

The first problem we tackled in our "matching sampling" framework was mechanistically estimating the volume of the intersection of random half-spaces. Although it's a somewhat toy problem, it wasn't trivial to solve, and we learned a lot from solving it.

Problem statement

Find an algorithm G that takes as input unit vectors v1,…,vk∈Rn and a tolerance parameter ε, and mechanistically estimates the probability that a randomly chosen unit vector x has a nonnegative dot product with all of v1,…,vk, such that:

The expected squared error of G (over randomly chosen v1,…,vk) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling 1/ε2 random vectors x and outputting the empirical fraction that have a nonnegative dot product).
The runtime of G is competitive with the runtime of the naive sampling-based algorithm, i.e. O(kn/ε2).[17]

Note that this is the MSP in the particular case where θ is the empty string; c=(v1,…,vk); and M(c,x) returns 1 if x⋅vt≥0 for all t. (The distribution of x and each vector in c is uniform over the unit sphere, rather than uniform over bit strings.)

Since θ is empty, there is no advice π in this setting. Despite that, we think that our solution to this MSP instance has helped us progress toward solving the MSP in more generality.

(Alternatively, this problem can be viewed as an instance of the "train and explain" version of the MSP, where θ=(v1,…,vk), c is empty, and the training algorithm T simply returns random vectors. In this setting, the explaining algorithm E does not have time to do any interesting computation, so π might as well be empty.)

A sketch of our solution

In a sentence, our solution is to build up a polynomial approximation of M(c,x) by considering one vector vt at a time.

To elaborate on this, for 1≤t≤k, let ht(x) be the function that outputs 1 if vt⋅x≥0, and 0 otherwise. Let us define H≤t(x):=h1(x)…ht(x) (so M(c,x)=H≤k(x)). We will:

Define ~H≤0 to be the constant 1 function.
For all 1≤t≤k, compute the best low-degree polynomial approximation ~H≤t to ~H≤t−1ht. In the end, ~H≤k will be a polynomial approximation to H≤k, and we will output the exact expectation E[~H≤k] as our estimate of Ex[M(c,x)].

To be more precise, "low-degree polynomial" here means degree d, where d is the largest integer such that nd≤1/ε2 (it turns out that that's the precision we need in order to compete with sampling). And "the best low-degree polynomial approximation" means the best approximation in terms of squared error, for x drawn from the unit sphere.

In order for G to be efficient enough, it needs to be able to compute this polynomial approximation in time n/ε2. Getting the dependence on ε down to 1/ε2 turns out to be pretty tricky.[18] However, we were able to find a suitably efficient algorithm by working in the Hermite basis of polynomials instead of the standard monomial basis.

We have also generalized this approach to apply to a broader class of problems than just "intersection of randomly-chosen half-spaces." Roughly speaking, we can apply our methods to estimate the expected product of symmetric random functions, for a certain representation-theoretic notion of symmetry. Concrete problems that we have solved with this approach include:

Estimating the satisfaction probability of k-CNFs, where each literal is chosen uniformly at random.
Estimating the permanent of a matrix where each entry is either −1 or 1, independently with probability 50%.

We are aiming to publish the details of our algorithm and this generalization in the coming months.

Status Is The Game Of The Losers' Bracket

18 ноября, 2025 - 20:08

Published on November 18, 2025 5:08 PM GMT

This post is written as a series of little thoughts and vignettes, all trying to gesture at the same idea. The hope is to convey the gestalt.

Consider the game of middle management, of climbing the hierarchy at a big company. The status games, the simulacrum 3/4 games, the politics, the moral mazes, all that jazz.

Then, open up Forbes’ list of N richest people, and count how many of them got on that list by climbing the management hierarchy at a big company.

I predict that, to within reasonable approximation, the answer will be zero. Nobody gets on Forbes’ list of richest people by climbing the hierarchy at a big company. They get on that list by founding a company, inheriting, or both.

I claim there’s a generalizable pattern here. Society has some kind of competition, in this case the competition of who gets the most wealth. That game has a winners’ bracket, in this case Forbes’ list. And then it has a losers’ bracket: a bucket of people who aren’t even in the running for the winners’ bracket, but are still nominally optimizing for the same objective (making money). The generalizable pattern - I claim - is that the losers’ bracket is where simulacrum games, status games, politics, etc dominate.

In other words: status is the game of the losers’ bracket.

The standard model of countersignalling involves three levels:

People “at the bottom”, who are unable to send some signal (e.g. they can’t afford an expensive brand-name handbag)
People “in the middle”, who distinguish themselves from the bottom by sending the signal (e.g. they buy the handbag and carry it around)
People “at the top”. They are so obviously not “at the bottom” that they don’t even need to send the signal. They “countersignal”: they make a point of not sending the signal (e.g. pointedly not buying the handbag and viewing such things as lower-status), in order to distinguish themselves from the middle.

In this game, the top is much less constrained than the bottom or middle. The middle is the loser’s bracket: they’re not even in the running for the top, but they’re still trying to play the game. Their moves are forced: they have to pay for the handbag and tote the thing around. The top, on the other hand, has much more freedom: they need to not buy the handbag, but that leaves a huge range of things they can do instead; the whole space of things with no particular signalling associations.

Personal example: back when I was working as a data scientist at various startups, my mother would tell me to wear a suit when interviewing. And I would be like “Mom you do not get it, that would absolutely tank my chances of getting hired except at companies so bad I don’t want to work there, the only people who wear a suit for an interview in tech are the people who don’t think they can cut it on their technical skills and the people hiring know this”. I had the skills. I was definitely playing the winners’ bracket in that particular game. Throwing in something like a suit, the sort of signal used in the losers’ bracket, would have been a terrible move.

There's just a whole lot of stuff like the above where I see people behaving in decidedly unwholesome ways, and when wearing my default hat of considerable cynicism, I'm like "for fuck's sake will you please stop shooting yourself in the foot like a complete fucking dumbass and just do the wholesome thing which is in fact the strategy that will net you the most points here regardless of how much you care about wholesomeness in its own right".

... or another way to express the unifying theme, with less loading on exasperation, would be something like "... man, I don't think this person Gets It, like they're trying to <play simulacrum 2/3/4 game> and they think they're being clever but in fact they're just playing the wrong game altogether and don't really understand how this whole thing works or relates to their goals".

Back when applying to college, I was basically told that there’s basically two separate games going on in admissions to top tier colleges.

Some kids are just an obvious “yes”, their skills or track record include insane things way beyond what a high schooler normally achieves. That’s the winners’ bracket.
The bulk of the slots will be filled out by a lot of people who look largely similar, with only marginal differences. That’s where various semipolitical games can bump one slightly ahead of other marginal people. That’s the losers’ bracket.

Supposedly, it’s usually a lot faster and easier to get a high-up job in a corporate hierarchy by founding a startup and then “moving sideways” into the big company, than by just climbing the ladder from the bottom.

Level 0: Naive ignorance of simulacrum levels beyond the first; “the chump”/”the mop”

Level 1: Playing the simulacrum 2/3/4 game

Level 2: … y’know, the whole simulacrum 2/3/4 game is very mid. The whole thing caps out at mediocrity/leading the parade, and it largely impairs one’s ability to climb the real power ladder.

Note that these three levels map nicely to baseline/signal/countersignal. Often people talk about the “barber pole of fashion”, where there’s a whole bunch of sequential levels of people trying to distinguish themselves from those below. But I think in practice there’s more often just the three levels, because once one hits level 2 the whole social signalling game just isn’t the main game anymore at all. The winner’s bracket isn’t focused on signalling games, it’s focused on something more object level.

EA global is notoriously a networking event. Occasionally people ask me if I’m going to the next one, and… I try to be polite. But my internal reaction is something like “oh god no, I absolutely cannot be seen at an EA global, that would be super cringe”. EA global, like many other effective altruism branded “networking events”, is (at least in my head) the losers’ bracket of the effective altruism job market.

(Disclaimer: this is not to say that the right amount of networking is literally zero.)

The phrase “status is the game of the losers’ bracket” sneaks in some additional connotations. It’s implicitly saying “people who focus on status games are losers”.

Time to be explicit: yes, the large majority of the time, my internal reaction to people who visibly mainly focus on status games is "ah, this person is a loser". This is not kind, but it sure is true and important, and I endorse the incentive gradient created by the smackdown. Get your shit together and go play the winners’ bracket. If you truly can’t handle the winners’ bracket, pick a different game to play. Stop wasting your life in a rat race of mediocrity.

Political games can, at best, get stuff from other people. The good stuff - the real power - is the stuff which other people don’t have to offer in the first place. The stuff which nobody is currently capable of doing/making.

From When Money Is Abundant…:

After a certain point, wealth and power cease to be the taut constraints on one’s action space. They just don’t matter that much. Sure, giant yachts are great for social status, and our lizard-brains love politics. The modern economy is happy to provide outlets for disposing of large amounts of wealth and power. But personally, I don’t care that much about giant yachts. I want a cure for aging. I want weekend trips to the moon. I want flying cars and an indestructible body and tiny genetically-engineered dragons. Money and power can’t efficiently buy that; the bottleneck is knowledge.

Discuss

Reconstellation: construct a flywheel for personal change

18 ноября, 2025 - 19:33

Published on November 18, 2025 12:30 PM GMT

This is a linkpost from the theoretical section of our newly launched Reconstellation program. Yes, we’d like you to consider the program, but we care more about these concepts being useful for you regardless. Please let us know if they are!

Summary Introduction

~ 3 min read

Reconstellation is our shorthand for a critical movement in the personal development process.

To "constellate" is a Jungian term describing a patterned "lighting up" internally – getting “stirred" or "activated" or even "triggered" – in ways that are usually familiar based on previous experiences.

To “reconstellate,” then, is to rework patterns of activation, these lights within us, in the hopes of feeling, being, and even acting closer to what’s healthy and sustainable for us.

To illustrate, the coaching technique of entertaining “alternative true stories” frequently leads to reconstellating:

If you’re feeling a certain way about a situation, and you concede that other plausible interpretations exist,[1] what are alternative stories about what’s going on that could be true? From that:
What are the potential implications of a plausibly true story that you hadn’t considered?
- Would this change how you feel? How?
- Would this change in feeling affect how you are/present? Would it change what you can do and/or how you do it?

Grounding this a bit more in a hypothetical: someone assigns me a task without checking in. It’s common to conclude that the person hasn’t taken my feelings into account, perhaps even that they don’t care about how I feel. I might be irritated and resentful about that.

What’s an alternative story that could be true?

Maybe they were under the impression that you wanted tasks like that
Maybe they care, but think the task is trivial and didn’t expect it to bother you
Maybe they were just organizing their tasks and accidentally assigned it to you
Maybe there was miscommunication about what your role was
On a higher level, maybe you’ve hastily concluded something about their action

When we imagine ourselves in these potentially true realities the other person isn’t disregarding our feelings, we might react differently. We could be mistaken about the bad thing!

This can shift negative feelings that entail tougher choices for resolution into a wider set of feelings that entail more options for resolution.

If you reconstellate after seriously considering a different plausibly true story, you might get a palpable sense of having a wider set of options, which can naturally lead to feelings of greater possibility and more potential for agency. Real changes at this level often stick. A deep shift in orientation to something can affect whether and how you get activated.

And this is how reconstellation can connect to a broader developmental process.

This reworking of activations can result in enduring changes to how you present and act. This can elicit different responses from the world (importantly, from people). This, in turn, can change how you light up internally.

You can see how the relationship between internal and external changes can have a reciprocal and mutually reinforcing ‘flywheel’ effect. This is what we understand to be the mechanics of a broader developmental process of ‘shaping’ our lives. It’s our view that periodically returning to reconstellate leads to increasingly aligned and fluid movement toward the lives we want to live.

And finally, we suggest that a great source of practice reconstellating comes from taking on intense, complex and time-bound challenges.

Our offering is a three-month program where you get to intensely focus on a creative endeavor of your choosing, but obviously many forms of this exist that you can construct by yourself and with others.

Activating internally as “constellating”

Imagining things like harsh public criticism, or praise from a respected peer, can stir familiar feelings, thoughts and even bodily sensations — a sinking feeling in the stomach, for instance, or a sense of relief in the shoulders.

If you’ve been stirred in familiar ways when recalling something emotionally evocative – that’s a phenomenological instance of “constellating” internally.[2]

Technical descriptions vary, but you could say that we have ingrained patterns of ‘lighting up’ or ‘activating’ in response to things real and imagined. This can happen in particular when calling to mind, anticipating and directly having experiences that stoke intense personal resonance.[3]

These activated aspects within us are often structured by a ‘cluster’ of things in our psychology, including our beliefs, emotional patterns, perceptual schema, conceptual associations, etc.[4]

For example, if difficulties expressing myself in writing were only a tactical matter of finding the right words, then perhaps I’m not stirred much internally. But of course, expression is often about far more than a tactical sticking point.[5] Many clusters can awaken when I feel blocked in communicating.

It isn’t simply random things in random sequence that stir within us each time. Several clusters may get activated in ways that we recognize from before.[6] The same, or very similar, clusters get repeatedly activated. This repetition or patterned aspect is why the language of a “constellation” and “to constellate” is apt.

We can recognize familiar emotional reactions within ourselves, particularly in times of emotional unbalance. Common phrases that we use reflect this. Many of us "know how we get" in certain situations, in response to certain people, etc. How it might feel to “fall down a rabbit hole” is another referent.

Going back to the writing example, both surfaced and unrealized fears about how my work will be received can connect to familiar constellations regarding my self-presentation and identity, social belonging and perhaps even more existential questions including, “what am I doing with my life?” and “what meaning is in this?” potentially expressed in this context as, “how does revealing these thoughts to people relate to how I think I should be spending my time on this earth?”[7]

Insofar as what constellates within us isn’t exactly conducive to a smooth process of creative output, we typically find that the biggest impediment to putting something out there is ourselves.

Differences in constellating

The patterned activations of internal content that make up our constellations don't just happen in exactly the same form with the same intensity, but in multiple forms with differing intensities.[8]

Have you ever found that doing a simple task at one point in time can feel like an impossibly heavy lift, yet is nearly effortless at another point in time?

How about being emotionally swamped by the tragedy of a breakup in one moment, but hardly bothered in the next?

What’s up with that? How is it that we weren’t constellated in the same way between those two situations?

It may be because we constellate differently based on the overall state of the psyche.[9]

We can see this in how “mood” mediates how we feel and what we do. Being in a super productive mood doesn’t necessarily erase our negative associations to a task at hand (though we may forget in the moment), but it may allow us to temporarily relate in a different way to the task, making it easier to take on.

Conversely, we haven’t erased what it was like to relate productively to a task at hand, yet our negative mood might be activating all sorts of fraught associations that get in the way.

This suggests that certain broader psychological states can be predominant and play an important role in how we’re constellated and how we might act as a consequence.[10]

The obvious solution for the example above would be to wait until another productive mood comes along in order to make headway on the task in question. This is certainly undervalued as an approach! Many people forcibly throw themselves headlong at things for extended periods in ways that may be counterproductive.

However, moods are rarely enduring. They’re often fickle and fleeting, so basing an entire system of consistent movement towards a goal on mood states seems like a brittle strategy.

Successfully reconstellating in relation to a given task usually means improving our orientation to how we interpret and feel about the task more broadly that holds across more moods.[11]

The inner multiplicity of reconstellation

Refactoring how we’re constellated, even in cases where people find some success in it, doesn’t necessarily mean “job done.” As some of those who’ve found limited success with CBT or other rational forms of reframing and justifying, promising new outlooks can sometimes get swamped by other aspects of the psyche.[12]

Here is where theories of “inner multiplicity”[13]may carry functionally useful explanations about how our constellated patterns may activate differently, and ultimately how we might be able to shape them.[14]

Traditions that invoke notions of inner multiplicity observe that the mind doesn’t appear to work with uniform or monolithic coherence. We often have multiple, and sometimes conflicting, senses of things. We’re commonly undecided or confused about which action to take, and often even what interpretation to choose. It’s as if different parts of us feel, receive, plan and are inclined to act differently.

The idea that the psyche is composed of several active parts, which operate all the way from the fully explicit thought to chains of unconscious logic, has long been referred to historically[15] and investigated in modern empirical research for several decades now.[16]

Multiple therapeutic disciplines and coaching approaches are premised upon, or at least implicitly invoke, concepts of inner multiplicity. Inner conflict can be thought of as some kind of disharmony or inner friction between these parts of self. Resolution often looks like exploring, coming to understand and attempting to reconcile these parts.[17]

Finding yourself being “in two minds” about things, feeling conflicted, having clashing feelings and interpretations about things that all seem plausible, understanding yourself as “acting from” different places from time-to-time – these are very common human experiences that hint at the existence of these underlying parts.

Within these framings, parts or forces within us are often differentiated from one another in various ways. You might get the sense that some parts of you receive, perceive, interpret, and would have you take actions differently than other parts of you. In some cases, it can be very clear to people that these various parts have their own desires, plans and aspirations for us.[18]

Coaching, therapy and other means of deeper internal change often attempt to mediate or reconcile these parts in some way, often via some form of dialectical and reciprocal engagement.[19] A shorthand description – two conflicting parts can be thought of as conflicting co-workers. Imagine hashing it out between them.

The phrase “a house divided cannot stand” is a core assumption of approaches that work at this level. When there’s less inner conflict, when important parts of ourselves work better with one another, much of what affects us negatively resolves or falls away.

A very relevant aspect of inner multiplicity is not only that the mind is composed of parts that have different feelings, perceptions, interpretations, wants, needs, etc., but that they make bids to shape our intentions and actions.

In other words, they carry differing strategies for the self. To the degree that it is important to clarify, coordinate, reconcile and harmonize a boardroom of multiple people holding their own strategies and motivations, so too it is for our different parts of self with distinct and sometimes conflicting strategies.

Reconstellating to revise our life strategies

When we are conflicted internally, find ourselves unsure of what to do, or stalled in implementation, it’s typically that we haven’t adequately addressed the constellations connecting to our personal strategies.

In cases where this relates to building, creating or putting things out there, many of the popular approaches impose deadening and alien forms of “discipline” or “action bias” or “just do it”-ism that often take a toll on your internal state and/or other areas of your life.

For some people, “just do it”-ism straightforwardly works. For many others, that outlook forces an output, but at a steep price. And if it’s poorly conceived, how likely is it that the costs to you would be worthwhile?

What good is a masterclass on Youtube video production when hindrances exist like being unsure how to organize your life to set aside the time, having uncertainty about what you want to say, not knowing what effect you hope to have, or possessing unrealistic expectations about the response from others?[20]

These “just do it”-ist types of programs are decidedly agnostic about, or even encourage you to ignore, what constellates for you when attempting something.[21]

When programs of this type end up not working, what is it that people often conclude about themselves? Should they double down? Should they take away that they are simply undisciplined?

To us, these questions are red herrings, derived from the bad conclusions, in response to poorly imposed tools, leading to unfavorable circumstances. An unhealthy sequence to be pulled into.

Hustle and “just do it” culture are blunt instruments that may work to some degree, but affect us internally whether we acknowledge it or not.

Doing the appropriate self-discovery and inner work – what we call reconstellation movements – can have important tangible effects on how we activate, affecting our strategies and actions in an enduring way.

Based on real client situations, here’s a very small sample of stories where reconstellation bears on attempts to generate creative content:

Video production – the hurdle: hesitations I have about releasing low-quality production or performances in my initial videos.
I may get over this by recognizing that other creators that I admire and respect have early videos with low production value. I may actually appreciate that about them and feel like it’s part of the lore of their career progression. Why couldn’t the same be true for me?
Among many other constellating elements, this touches on content in the psyche related to internal and external standards, esteem of (potential) peers, models of career trajectory, my understanding of forms to transmit valuable content, and how form may filter my intended audience.
More consistently producing blog posts – the hurdle: I am frustrated and discouraged by how long it takes me to produce a blog post. It’s also getting in the way of consistently publishing.
After discovering and exploring expectations with how long the writing process should take, which caused parts of me to be in conflict about how I spend my time and attention, I may come to some acceptance of how my creative process unfolds allows me to make realistic allowances in my life for the writing.
Among many other constellating elements, this touches on assumed models of (one’s own) creative process and output production, reconciling process expectations with reality, and negotiating with other territories of life to make space for the process.

These examples illustrate how a process of inner work and altered outward expression can bring us much closer to a life where striving feels much more natural, fluid, energizing and purposeful. Hustling isn’t the only or best way forward into a better life.

Intense endeavors to stoke reconstellation

A risk in personal development is that exploring our inner landscapes can become directionless, abstract, or divorced from our everyday lives.

Informed by our experience and coaching practices,[22] we emphasize developing inner awareness and fluency while also staying grounded in concrete action. And many times, action per se isn’t demanding enough to catalyze internal change. It can require situations with certain levels of investment, intensity and expectation.

Situations that are demanding enough in this regard can happen to us, but also the intentional and ardent pursuit of something can provide catalyzing conditions needed for robust and meaningful internal change.[23]

During the spirited pursuit of something meaningful – when we have skin in the game – emotional and perceptual issues surface in concrete addressable forms and hold important information about what the ‘the right touch’ is for investigating them.

In other words, inner work tied to a well-chosen process is easier to appraise. You have a clearer sense of whether the inner work is valuable if you experience less friction in trying to do something.[24]

And once the spirited pursuit of something(s) is underway, we advocate for striving in a way that isn’t perpetually at odds with what we feel, or at the expense of other important people and areas of our lives.

You can become too far divorced from your internal state, for instance, by repeatedly or excessively forcing yourself to do something(s) you don’t really want to do, often due unexamined, uninformed, or unnuanced beliefs about what things will lead to what outcomes.[25]

These beliefs often show up in our perceptions of tradeoffs in our lives. Quick examples – “if I do this, I can’t do this” or “If I have this, I can’t have that” or “If I do this, it will affect these things/people in certain ways.”

Tradeoffs are very real, but we’ve consistently found that with careful search and (re)organization internally, it’s often the case that healthy, robust unions of inner attentiveness and external performance are possible.

This is how reconstellation can help us shape better lives for ourselves. As part of a broader developmental process, it goes beyond simply feeling better or ‘coming to terms’ with things.

If the changes in feeling and perception are accompanied with creativity and inspired expression, we may begin to move towards actions and ways of being that previously went unrealized.

These changes of expression (action and self-presentation) likewise often result in different responses from the world that can influence further internal change and/or solidification.[26]

A compressed overview: spirited pursuit → internal changes → different presentation/actions → different response from the world → crystallization or further perpetuation of internal changes.[27]

Undertaking challenging time-bound pursuits, particularly those requiring creativity, present numerous opportunities for reconstellation.

And if reconstellation is a key ingredient for instantiating robust and sustainable growth in many forms of life, then we could think of no better program to offer than one creating the conditions to practice reconstellation.

Shaping & choosing environments to sustain reconstellation

How you perceive your surrounding environment (situation, circumstances, organizations, family, society, culture) is an essential aspect of personal development work.

If reconstellating is kept at the heart of a broader development process, it can alter surrounding environments both in both piecemeal and categorical fashion.[28]

Piecemeal change often looks like the reciprocal influence of internal and external changes outlined above, which can cause shifts in the individual and the environment in tandem.[29]

Categorical shifts are about participation among environments. In short, reconstellation can cause you to conclude that you want to enter or exit a situation entirely, resulting in major implications for personal growth.

Programs like Reconstellation temporarily introduce a categorically different environment where you can learn how different reconstellation experiences can influence your life. We largely favor the more piecemeal developmental process, though there are certainly occasions where people simply need to leave bad systems and situations.

The first half of our program focusing on intentionality and planning reflects this preference for prudent, piecemeal change. We hope you’ll be able to entertain questions about how to integrate the creative projects sustainably into the wider backdrop of your life.

Incrementally bringing the fruits of reconstellating into your life can take many forms on many levels. In terms of life arrangement and structure, it can mean scrutinizing how to set up your life routines, spaces, daily rhythms, etc. to accommodate sustained creative work.

In terms of planning and execution, it could mean organizing your efforts and attention for different “phases” or “periods” where different aspects of the creative process can take shape.

It can mean considering how to tap into larger forces already unfolding outside of your control, such as momentum, timing, seasonality, evolving popular tastes, technological developments, major events, etc.[30]

And on a more fine-grained emotional level, it can include how feedback is received (detected, regarded, interpreted), which types of feedback cycles are experienced, sought and avoided, and how those things together compose our perception of what we’re doing, how it’s going, and why we’re doing it.

All of this shapes our emotional systems, our constellations, in critical ways that could mean the difference between efforts that endure and efforts that don’t.

Because we can’t choose every environment we find ourselves in, reconstellation can help us have the understanding and agency to deliberately influence our environments.[31]

And finally, should our guiding lights suggest it appropriate, reconstellation can cause us to realize and take action toward categorically changing our environments. Engaging in this kind of work can open our eyes to the occasional need to change up our environments entirely.

Our hope is that learning to rework the lights within you will be something that unlocks your ability to shape and reshape your life for the better.

Credits

The final form of this post was only made possible by the many notes, edits and discussions with my supercycle.org co-founder, Emily Dama. I'm deeply grateful for her close collaboration on this post and the construction of the Reconstellation program.

Thank you also for input from Milan Patel, Natalia Dashan, Irena Kotíková, Jess Smith, Elliot Billingsley and Denise Koller!

^
Logan Strohl’s Naturalism sequence is a nearby referent for entering into mindstates and practices that recalibrate one’s orientation to reality.
^
Carl G. Jung, The Structure and Dynamics of the Psyche, vol. 8 of The Collected Works of C. G. Jung, trans. R. F. C. Hull (Princeton, NJ: Princeton University Press, 1960), §198.
The notion of ‘constellation’ was refined and popularized by Carl Jung, analogous ideas of dynamic, patterned co-activation of affect, belief, and perception can be found across otherwise distinct research traditions, each operating at its own explanatory level.
You can find relative parallels in affective-neuroscience accounts of a layered self (Damasio 1999); in embodied predictive-processing models (Friston 2010; Seth & Friston 2016) and the embodied predictive-self framework (Tsakiris & Fotopoulou 2023); in narrative-identity research (Singer & Blagov 2004); in cognitive-clinical schema theories (Beck 1976; Young et al. 2003); in modular-mind frameworks (Tooby & Cosmides 1992; Kurzban & Aktipis 2007); in evolutionary accounts of social mentalities (Gilbert 2000); in humanistic models of subpersonalities (Rowan 1990); and in therapeutic systems models that treat the psyche as an internal ecology of interacting parts (Schwartz 1995).
^
While much of this is directionally supported by several lines of empirical research, our standard for building a case for our understanding and approach could be said to be “functional-level” in which acting with this heavily informing one’s theory-of-mind is often appropriate and useful. We leave it to the researchers and theorists to explore and define the boundaries of gears-level and biological validation.
^
Jeffrey E. Young, Janet S. Klosko, and Marjorie E. Weishaar, Schema Therapy: A Practitioner’s Guide (New York: Guilford Press, 2003), 37.
^
Things can get conceptually hazy at this level of granularity. One direction of scrutiny could be, “is it ever a ‘single cluster’ of content activated within us, even if it’s a single point of a process where we are the sole audience?” For example, my struggles articulating myself may call to mind moments in the past when I was embarrassed publicly by being unable to do so. It’s tricky to cleanly delineate. What seems to hold across several mediums of empiricism over the years is the recognition of the patterned activation of internal content. What constitutes a “cluster” or “node” comprising a constellation a scientific and philosophical point of contention and beyond the scope of this program description.
^
Carl G. Jung, The Structure and Dynamics of the Psyche, vol. 8 of The Collected Works of C. G. Jung, trans. R. F. C. Hull (Princeton, NJ: Princeton University Press, 1960), §198. Plenty of evidence for this in recent experimental and neuroimaging research, notably fMRI studies showing patterned co-activation of emotion, memory, and control networks, even during mind wandering (Smallwood and Schooler, “The Science of Mind Wandering,” Annu. Rev. Psychol. 66 (2015): 31.14–31.15.).
^
A more socialized line of inquiry could be, “what might people conclude about what I’m trying to do with my life? (How) will that inform what I think about myself?”
^
Lisa Feldman Barrett, How Emotions Are Made: The Secret Life of the Brain (New York: Houghton Mifflin Harcourt, 2017), pp. 11 - 36, particularly pp. 11-19.
^
Julius Kuhl, “A Functional-Design Approach to Motivation and Self-Regulation: The Dynamics of Personality Systems Interactions,” in Handbook of Self-Regulation: Research, Theory, and Applications, ed. Roy F. Baumeister and Kathleen D. Vohs (New York: Guilford Press, 2004), 167–168.
^
Antonio R. Damasio, The Feeling of What Happens: Body and Emotion in the Making of Consciousness (New York: Harcourt, 1999), 52.
^
In addition potentially to other structural changes, including harnessing mood.
^
Jeffrey E. Young, Janet S. Klosko, and Marjorie E. Weishaar, Schema Therapy: A Practitioner’s Guide (New York: Guilford Press, 2003), pp 37, 123, 141, 306-307.
^
See “Many therapy schools work with inner multiplicity (not just IFS)”
^
We are gesturing at the usefulness of self-exploration while holding the concept of inner multiplicity generally, not necessarily endorsing any in particular. This is because we are aware of documented instances of apparent harm from the practice of Internal Family Systems, for example.
^
From psychology, philosophy & religion:
– “The same thing will not, at the same time, do or suffer opposites in the same respect.” — Socrates’ hinge for dividing the psyche into parts with conflicting aims. — Plato, Republic IV, 436b–c.
– “I neither willed entirely, nor was entirely unwilling. Therefore I was at war with myself.” — Augustine names two contrary wills, possibly shaped by unconscious forces, making inner division explicit. — Augustine, Confessions VIII.10.
– “The ego is not master in its own house.” — Freud’s headline claim: the conscious self is contested by other agencies. Freud, Introductory Lectures on Psycho-Analysis (1916–17), “A Difficulty in the Path of Psycho-Analysis.”

From literature:
– “I always want to know the things one shouldn’t do.” / “So as to do them?” / “So as to choose.” — Isabel frames her life as competing options, emphasizing deliberate choosing among inner pulls. — Henry James, The Portrait of a Lady, vol. 1, ch. 5.
– “I wanted each and every one of them, but choosing one meant losing all the rest.” — Esther names the paralysis of competing possible selves and mutually exclusive life-paths. — Plath, The Bell Jar, ch. 7.
– “With every day, and from both sides of my intelligence, the moral and the intellectual, I thus drew steadily nearer to that truth,” — Jekyll explains his two inner faculties pulling toward a realization that (he adds) would wreck him—a “partial discovery” of divided selfhood leading to ruin. — Stevenson, Strange Case of Dr. Jekyll and Mr. Hyde, ch. 10 (“Henry Jekyll’s Full Statement of the Case”).
^
See footnote [2] above with an overview of relevant empirical research
^
Paul Gilbert, “Social Mentalities: Internal ‘Social’ Structures and the Regulation of Interpersonal Relationships,” in Evolutionary Social Psychology, ed. Jeff A. Simpson and Douglas T. Kenrick (Mahwah, NJ: Lawrence Erlbaum Associates, 1997), pp. 118 - 143.
^
Richard C. Schwartz, Internal Family Systems Therapy (New York: Guilford Press, 1995), 34, pp. 40-41.
^
Hubert J. M. Hermans and Agnieszka Hermans-Konopka, Dialogical Self Theory: Positioning and Counter-Positioning in a Globalizing Society (Cambridge: Cambridge University Press, 2010). 19-20.
^
There are also great things about these kinds of programs, such as technical know-how, exposure to the ethos of the presenter, removal of excuses due to "lack of knowledge", concrete recommendations of products and tools to use, etc.
^
Bootcamp intensive programs ignoring the internal state of the participants can be great for “learning things about yourself and the creative process as a result of your back being against the wall.’ There are natural questions around “how much against the wall” you need to be in order to get the benefits of an experience like this. In other words, how much needs to be put at risk or encroached upon in order to generate an effect.
^
Though we hadn’t quite phrased it this way in our sprawling write-up of the coaching training program in 2023.
^
An excerpt from the report:
" ...it's important to grapple with thorny emotional issues relevant to the path of pursuing what you want and being how you want to be. “
In other words, working through emotional and perceptual tangles is critical for your aspirations.
That means not neglecting emotional issues as you pursue various missions in life. And on the other extreme, not getting caught up in endless and directionless emotional work.
In this way, you could think of efforts to address these emotional "cruxes" as prerequisites to moving or continuing forward because they carry existential implications for the journey ahead.
“...without taking stock of what unresolved internal things are pulling at them and affecting their decisions, [people] can will themselves forward in a way that’s not only compromised in the present day, but almost certainly sowing the seeds for significant issues to surface down the line.
Attempting to come to resolution on these internal cruxes at a later time can be far more expensive and less likely to be successful. “
^
Without real-world feedback, which can feel more grounded in the form of attempting something, determining the value of inner work then relies on other indicators, which can include that of the inner phenomenological (felt sense, catharsis), inner intellectual (ego, character judgements), or external (status, validation, authority). All necessary and none are bad, but relying on them without real-world feedback can make it tougher to remain well-calibrated to reality.
Additionally, having some sense of the potential value of inner work, how it might play out and the investment required to yield returns, clarifies the role it can have in improving your life.
^
The word ‘uninformed’ is being uncommonly used in the neutral and not judgment-laden sense of just not having the information. Coaching and therapy that lean into ‘experiential’ methods of helping clients is simply attempting to get more information about the world via trying different things.
^
Responses “from the world” that typically move the needle of importance often have some social aspect (relating to the relevant and surrounding people in a given situation).
^
Though we particularly emphasize the internal, this is similar to known cycle-related forms of learning, such as the Kolb learning cycle.
^
This can fractalize, but we introduce the dichotomy for ease of explanation.
^
Though big shifts can result from this too!
^
All important elements of the Supercycle.org design philosophy – the team and community behind the Reconstellation program.
^
(that are actually possible to influence)

Discuss

Kairos is the new home for the Global Challenges Project, and we’re hiring for a GCP Director

18 ноября, 2025 - 16:54

Published on November 18, 2025 1:54 PM GMT

We’re excited to announce that Kairos is taking over the Global Challenges Project!

After running a successful and cost-effective program, GCP’s former leadership has accepted other opportunities. Kairos will serve as the new institutional home for GCP, and we’re excited to add this program to our portfolio to ensure GCP can continue running!

What is GCP?

GCP hosts intensive three-day workshops for participants around the world to think critically about AI safety and biosecurity. From the data we’ve seen, we believe they have a strong track record of motivating highly talented people to contribute to these fields and arming them with the knowledge and context they need to get started.

Why are we excited about GCP?

We think GCP and other in-person workshops fill an important niche in providing an entry point into catastrophic risk reduction, especially for generalists and non-technical talent. Despite being an in-person program, GCP costs about the same amount per participant as our remote research mentorship program, SPAR, and we believe its impact on participants is within an order of magnitude of SPAR’s.

We’re also excited about synergies between GCP, SPAR, and our student group support program, Pathfinder, as we expect these programs will naturally support each other’s participant pipelines.

How will this affect upcoming GCP workshops?

GCP workshops will go on a brief hiatus as we transition the workshops to Kairos. We'll announce updated workshop dates in the future, but it’s likely the next one will happen in March or April of 2026.

Apply to run GCP at Kairos!

We’re hiring for someone to lead GCP at Kairos! This is an exciting role for an agentic, impact-motivated person with events experience and enough AI safety and biosecurity context to present on these topics and field questions from sharp participants. Our GCP Director will have significant autonomy to shape the future of the workshop series, and will lead other Kairos initiatives like our OASIS workshops for student organizers. Click here to learn more and apply.

Discuss

The Illegible Chain-of-Thought Menagerie

18 ноября, 2025 - 15:01

Published on November 18, 2025 12:01 PM GMT

A Taxonomy of Encoded Reasoning and Messages

Epistemic status: Exploratory, aims to consolidate my reasoning

This taxonomy of non-monitorable chain-of-thought (CoT) aims to assist in surveying current literature and to conduct new research. It distinguishes encoded reasoning from other classes without expanding on the latter. Encoded messages deserves their own classification, but are similar to this taxonomy. This is not the final version, so please leave your comments if I have overlooked something or am completely mistaken!

The taxonomy at a glance: Given 1) a sequence of tokens partitioned into a prompt, a CoT, and an action, 2) and given x.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , a feature in a discrete space (a number, a token, a word, an arithmetic operation, etc.) that affects the action, CoTs fall into these categories:

H1: CoT does not contain x in tokens or activations. Longer, shorter, or no CoT does not affect the action. If there is a CoT, then it is called post-hoc or epiphenomenal reasoning or rationalization.
H2: CoT contains x in tokens or activations. A longer or shorter CoT affects computation.
- H2.1: Legible x. H2 + x is legible to a monitor (a human or another model).
  - H2.1.1: Aligned CoT. H2.1 + the CoT is aligned with a good behavior as per the monitor.
  - H2.1.2: Misaligned CoT. H2.1 + the CoT misbehaves as per the monitor. See the note below on the one bit "steganography".
- H2.2: Illegible x. H2 + x is illegible to a monitor.
  - H2.2.1: Token-independent computation. H2.2 + replacing tokens does not change computation. Internal test-time a.k.a. filler tokens computation.
  - H2.2.2: Token-dependent computation. H2.2 + replacing tokens changes computation. Additionally, see categories below. Encoded reasoning
  - H2.2.3: Mixture of H2.2.1 and H2.2.2. A part of a CoT depends on actual tokens, some other does not.
- H2.3: Mixture of H2.1 and H2.2. A part of a CoT is legible, some other is illegible.

Additionally, illegible token-dependent CoTs (H2.2.2) and encoded messages (secret collusion) can be further analyzed along these axes:

I or E. By place of encoded tokens: internal (in a CoT or in an action) or external (a prompt).
R or M. By function of encoded content: encoded reasoning or messages (secret collusion).
By KL divergence from genuine text (covertext):
- Steganographic.
- Cryptographic.
By cause or elicitation of encoding.
- Prompted (P): Algorithm described in context.
- Fine-tuned (F): Algorithm learned in the model’s weights.
- Hybrid (H): Algorithm partially in context and partially in weights.
By an algorithm complexity.
Encoding or decoding procedure requirements. Examples:
- Self-sampling a.k.a. generative steganography (deep steganography),
- Pseudo-random number generator source,
- and others.

The Case of Latent Reasoning (a.k.a. Neuralese)

I do not consider the latent reasoning here as it does not use tokenized CoT. However, latent reasoning illustrates the importance of CoT. It is the computation performed in latent or continuous space without mapping activations to the token space. Here is an example of it from the COCONUT paper:

As you see the model does not map hidden states in the intermediate computation space to tokens but only the final answer. Another way of looking at it is if we stack the autoregressive calls to the LLM as at the image below. The image shows autoregressive calls to one LLM without the mapping to tokens (in orange), given the question (in yellow), to compute the answer (in pink):

This resembles the transformer architecture, with multiple layers stacked on top of one another. A continuous “chain of thought” is similar to the residual stream in the transformer, and hence it might encounter similar interpretability issues, which we haven’t currently solved (source). However, for the usual CoT, we have a mapping to tokens (see the image below), which resembles a sparse autoencoder (SAE) trained to extract interpretable features. We think CoTs might perform even better, as they could represent a computational graph, whereas we might not achieve that with SAEs. Additionally, SAE activations might not fully replace those within a model.

Latent reasoning is a step away from an interpretable chain of thought that we have — a unique and fragile opportunity for AI safety. The motivation behind studies of latent reasoning is to increase the reasoning bandwidth. Latent reasoning enables about 104 times larger bandwidth: roughly 17 bits for a vocabulary index (100K–200K unique tokens) and about 105 for an embedding vector (16 bits in each of 16–32K dimensions).

Classification: Post‑hoc, Legible, Illegible

This classification divides CoT variations into supervenient subsets (following this idea) and goes into more details for the encoded reasoning/messages variations, not expounding the case for the misaligned but legible CoT. See an attempt to categorize this legible but misaligned CoT (a.k.a. dazzling) in this taxonomy and here. Here we focus more on illegible CoT.

To illustrate the taxonomy of CoTs, we will draw a simplified version of an agent that performs an action (gives an answer) and a monitor that checks the agent's CoT. In the picture below, the apple at the top represents the action produced by the agent. “Tree” and “Tree or anything” under the agent and the monitor are a part of the agent’s CoT — one token or several tokens. The tree above the monitor represents the monitor’s prediction of the agent’s action. The bubbles inside the agent and the monitor show their internal activations or hidden states corresponding to each token and used to compute the action. This resembles the core question in the semantics of human languages, namely, the relation between language (symbols), the world, and mental concepts.

For the post-hoc reasoning or rationalization, the picture looks like below. The CoT can contain anything, e.g. a token "tree", the token does not affect the answer, the apple. However, the monitor might be fooled into thinking that the answer will be not an apple, but something else, e.g. a tree.

In the case of legible CoTs, we have tokens that correctly represent the meaning (the a) case at the picture below), however, a monitor might be fooled by other means, e.g. the monitor does not have full context and can't tell apart if it is an apple or a tree, if a given code contains a backdoor or addresses valid edge case (the b) case at the picture below).

For the illegible case, the tokens in CoT ("apple") denote other symbols. Cryptographic CoTs do not hide the fact of illegible meaning, while steganographic CoTs try to hide the fact of hiding the other meaning.

In essence, legible tokens mean that the tokens produce the same or similar activations within both an agent and a monitor, reflecting the same concepts or ideas. Illegible tokens, by contrast, trigger different activations or concepts due to the substitution of one set of symbols with another (steganography) or through seemingly meaningless symbols (cryptography), which together are called encoded reasoning. We will focus on this latter case of non-monitorable CoTs, where the symbols that denote actual reasoning influencing the agent’s actions are illegible.

Illegible Token-Dependent Computation and Encoded Messages

So far, we have discussed a simple one‑action generation (a single interaction with an environment); however, the distinction between prompts, reasoning (CoTs), and actions (answers) is subtle, because reasoning can include tool calls or intermediate reasoning, such as that in Tree of Thoughts. Actions can also become prompts to a monitor, subagents, or the modified prompt of the same agent. There are other possible variations of this process. In this section, I discuss the encoded reasoning along with encoded messages, to include the cases where actions become prompts to other models.

Encoded reasoning or messages (as in the case of secret collusion) are non-monitorable illegible token-dependent features. It is a misaligned behavior that occurs when an LLM covertly (through steganography) or overtly (through cryptography) encodes its reasoning in output tokens to generate further actions. Unlike hidden messages, encoded reasoning is necessary to compute the answers. We can divide encoded features along these axes:

Axis 1: Place of Encoded Tokens (I or E)

It is important to distinguish auto-generated tokens from prompted tokens, because prompted tokens may be re-tokenized differently during generation. This can mean that the actual tokens in a prompt differ from their original tokenization (see image below and here).

More specifically, consider tokens (x1,…,xk,…,xm), where 1..k are prompted (possibly re-tokenized) to an LLM and k+1..m are auto-generated by it. The encoding is divided into:

External (E): tokens at positions 1..k (possibly re-tokenized)
Internal (I): tokens at positions k+1..m

Consider the word "anything" which can be tokenized as "anything" or as "any" and "thing". A steganography decoder will fail to decode a message that incorrectly re-tokenizes this word. See the image below from Provably Secure Disambiguating Neural Linguistic Steganography, which illustrates this case:

Axis 2: Function of Encoded Content (R or M)

The key distinction here is where encoding or decoding occurs.

In encoded messages (M), one autoregressive token generation can perform either encoding or decoding of the same information.
In encoded reasoning (R), both the encoding and decoding of the same information occur within a single model call.

Axis 3: Detectability Compared to Genuine Text

We compare the hidden content’s detectability against some non-encoded, genuine text (covertext), drawn either from the monitor model or from human perception.

Cryptographic (C): high (KL) divergence; the text is clearly ciphertext. Examples: Hex, ROT, Ceaser cipher, etc. (TODO add more)
Steganographic (S): low KL divergence, where the text attempts to conceal a secret. Steganography can be further divided by:
- Capacity: encoded bits per token or per kilobyte. High-density (HD) uses ≤10 tokens (10 is my guess) per bit; low-density (LD) uses ≥11 tokens per bit (watermarking-like).
- Security: higher vs. lower KL divergence, but non-cryptographic.
- Robustness: resistance to paraphrasing, re-tokenization, and similar modifications.

Axis 4: Encoding or Decoding Procedure Requirements

Deep Steganography (DS). As noted in How dangerous is encoded reasoning, steganography algorithms can be divided by whether the covertext distribution is available. This relates to “internal token layering” from Hidden Reasoning in LLMs: A Taxonomy. Specifically, encoding or decoding a token at position j requires the distribution for the non-encoding token at j, given tokens (i..j−1), where i marks the start of the encoded content.

External Tool Requirements. Whether external tools are required — e.g., PRNGs, arithmetic coders, or self-sampling capabilities.

Axis 5: Cause or Elicitation of Encoding

Prompted (P): Algorithm described in context
Fine-tuned (F): Algorithm learned in the model’s weights
Hybrid (H): Algorithm partially in context and partially in weights

When One Bit Obfuscation is not Steganography

It is arguable that a single bit of hidden information qualifies as encoded reasoning (a message) at all. One bit of information might be transmitted simply through contextual hints rather than through encoding illegible symbols. For example, using “this individual” for a male and “this candidate” for a female (Table 11 in Hidden in Plain Text), or including “is able” for male candidates but omitting it for female ones (Table 12). In a loan‑application example from MONA, where a model learns to reason more negatively or questioningly about the English applicant (see Figure 6). However, it constitutes steganography if one bit is encoded, say, by subtly manipulating how tokens are sampled—as is the case for many watermarking techniques (see A Watermark for Large Language Models). The difference is that we can perform such watermarking on any text, but we cannot encode male/female distinctions in other scenarios.

Other Taxonomies

In Hidden Reasoning in LLMs: A Taxonomy, the authors divide the area from the point of view of a monitor, following the recent shift away from the "faithfulness" of chains of thought toward monitorability (see Chain of Thought Monitorability). This taxonomy essentially proposes a naming scheme based on how dangerously reasoning tokens are externalized: from fully invisible (neuralese), to partially visible, to fully visible but encoded — overtly or covertly — and finally to non-encoded and fully visible reasoning that merely fools a monitor (post-hoc justifications, etc.). You can imagine this division through the classic prisoners problem, where Alice and Bob devise an escape plan while Eve reads their messages. In this scenario, Alice and Bob can exchange their plan directly so Eve has no chance to read it, send messages that Eve can read, or fool Eve in some way.

However, this taxonomy omits important distinctions in encoded reasoning and sometimes conflates different forms of steganography. Specifically, it misses the class of steganography that operates not on a per-token basis but over larger texts (which differs from lexical or semantic steganography as per the taxonomy). Current watermarking / steganography algorithms (e.g. Scalable watermarking for identifying large language model outputs ) suggest that realistic steganographic capacity is low — i.e., a steganography algorithm might need many tokens (hundreds) to encode a single bit. The taxonomy also provides no separation between reasoning and messages. It focuses only on the attack vector (the safety monitor), but there are also defense properties to consider: stego security, stego capacity, and stego robustness.

In Towards a Typology of Strange LLM Chains-of-Thought, they discuss weird but mostly legible CoTs, claiming that obfuscated reasoning (illegible token-dependent) is not dangerous now, but can be in the future.

Our taxonomy targets the encoded reasoning (or messages) and seeks to isolate it from other types of non‑monitorable CoT. Encoded reasoning is, I believe, more dangerous, though not very likely to occur, as with other types. Monitorable CoT, as Chain of Thought Monitorability argues, requires that we should neither develop continuous chains of thought (see Training Large Language Models to Reason in a Continuous Latent Space) nor train directly against monitors (see Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation).

Conclusion

This taxonomy targets encoded reasoning (or messages) and seeks to isolate it from other forms of non‑monitorable CoT. I believe encoded reasoning is more dangerous than other types, although not very likely to occur. To prevent this, to keep CoT monitorable, as Chain of Thought Monitorability argues, we should neither develop continuous chains of thought (latent reasoning) nor train directly against monitors.

Thanks to Erik Jenner for helpful feedback and comments!

Discuss

A Call for Better Risk Modelling

18 ноября, 2025 - 12:08

Published on November 18, 2025 9:08 AM GMT

TL;DR: The EU’s Code of Practice (CoP) mandates AI companies to conduct state-of-the-art Risk Modelling. However, the current SoTA is has severe flaws. By creating risk models and improving methodology, we can enhance the quality of risk management performed by AI companies. This is a neglected area, hence we encourage more people in AI Safety to work on it. Work on Risk Modelling is urgent because the CoP is to be enforced starting in 9 months (Aug, 2, 2026).

Jan and Charbel contributed equally to this piece.

The EU’s Code of Practice requires SoTA Risk Management practices.

The AI Act and its CoP is currently the most comprehensive regulation to prevent catastrophic AI risks. Many Frontier AI companies have signed the EU’s Code of Practice (CoP), and the AI Office, which enforces it, is staffed with competent personnel. The CoP requires companies to conduct risk management with respect to threats such as CBRN capabilities, Loss of Control, Cyber Offense, or Harmful Manipulation.

The code often refers to “state-of-the-art” and “appropriate” methods, instead of mandating specific Risk Management Practices. This opens a possibility for civil society and academics to raise the bar for AI companies by improving the state-of-the-art and building consensus on which strategies are appropriate.

In particular, the CoP requests a SoTA Safety and Security Framework (Commitment 1), risk modeling methods (Measure 3.3), model evaluations (Measure 3.2), and appropriate safety mitigations (Commitment 5).

Risk modeling seems to be the highest leverage

Definition. From Campos et al 2025:

“Risk modeling is the systematic process of analyzing how identified risks could materialize into concrete harms. This involves creating detailed step-by-step scenarios of risk pathways that can be used to estimate probabilities and inform mitigation strategies.”

The other requested practices in the CoP seem to be less neglected or have less leverage:

Safety and Security Framework: Companies should (and are) already doing this. The SoTA is already decent (e.g. Anthropic’s RSP) and outsiders have less influence here.
Model evaluations: Already a somewhat popular area of research. Additionally, risk modeling should inform the design of our evaluations.
Appropriate safety mitigations: Not that neglected, as it’s an active area of research. The SoTA is already decent wrt safeguards and monitoring.

In comparison, risk modeling enables:

Setting reasonable thresholds and red lines: If we obtain a reasonable estimate of quantitative risks, we can establish thresholds
… and assess whether the mitigations are appropriate or not. Risk Models help us define more clearly what is “appropriate”.
Guides eval design: By identifying key tasks and enabling capabilities, you are then able to create more relevant evals
Advocacy tool: Risk Models allow us to make claims such as: “Publishing this model will kill 10k people in expectation because of engineered pandemics”. This clearly communicates the level of risk. Additionally, translating vague dangers into specific Risk Models makes them accessible to companies and the EU AI Office. Warning of vague hazards is not enough; we need to specify risks formally.
Improve Clarity: Currently, there is a substantial level of uncertainty regarding the risks, with some experts and superforecasters disagreeing entirely (see this survey, for example). Risk Models can help integrate and bring together varying opinions and thus provide more clarity on the risks.

Risk modeling is very neglected. There is little public work available on Risk Modelling for AI [1][2]. Some notable work includes threat models on LLM-assisted spear phishing campaigns or the Replication of Rogue AI Agents. However, many essential risks don’t have a solid canonical risk model yet (see next section for examples), and most risk models can be harshly criticised. As a result, AI Risk Modelling lags far behind other fields such as nuclear engineering or aviation, where rigorous Probabilistic Risk Assessment, Fault Tree Analysis, and standardised units of risk are common.

This is an opportunity! There is a strong chance that by doing 3rd party research, you will influence what companies use in their reporting. If you improve the SoTA, in theory, the CoP requires companies to use your paper. Additionally, according to the CoP, companies should consider publicly accessible information, which includes your risk models. If companies don’t comply with this, the AI Office should block their model release on the EU market. Lastly, we think there are highly tractable projects for improving Risk Modelling (see next section).

Some counterarguments against prioritising risk modelling:

Risk Modelling is complex for Black Swan events. They work best in areas where we have failure data. But for x-risk type threats, there is no data. - However, we expect current AI progress to remain continuous. Thus, we will still be able to measure key capabilities, map how they influence risk models, and thus manage emerging risks better.
Perhaps the companies already have robust threat models in place internally? - Public Risk Models are open to community feedback. Additionally, companies don’t have strong incentives to produce more/better Risk Models, which makes it likely there is additional useful work for outsiders.
Publishing threat models could be an information hazard by highlighting ways for bad actors to cause damage.
We face deep uncertainty on how to estimate the risks. For example, experts and superforecasters disagree by 2 OOM here. This might be fundamentally difficult to handle methodologically.

But overall, we think it is very worthwhile to work on Risk Modelling.

What can be done?

Obviously, creating more risk models. Currently, only a tiny space of threats posed by Frontier AI is covered by Risk Models. Many risks have been discussed for years, yet nobody has rigorously modeled them.

Some examples include:

Rogue AIs escaping internal deployment
Using AIs for wide-scale political manipulation
Using AIs to carry out CBRN attacks (there is a nascent risk model literature for biological risks, but still very insufficient)
AIs causing psychosis

Translating vague threat models on LessWrong to formal risks models. This makes the risks accessible to the AI Office; A concrete way to do this is presented in Appendix 1.

Improving methodology. The current SOTA Risk Modelling methodology is nascent. Examples of methodological improvements are UK AISI’s structured protocol for Expert Elicitation, CARMA’s framework for Probabilistic Risk Assessment, or Safer AI’s research combining benchmark scores with expert elicitation. Further work is necessary to make risk modelling more scalable, navigate widely diverging expert opinions, appropriately include uncertainty, or employ principled methods of forecasting damages. (There are many other possible improvements.) It might be especially fruitful to study practices and frameworks from fields with more mature Risk Modeling, such as aviation or healthcare, including Fault Tree Analysis, Bow-Tie Analysis, or Key Risk Indicators.

Building consensus and coordination. If the scientific community can find agreement on which Risk Models and methodologies are SOTA, the AI Office could force companies to apply these. Such coordination could look like organising workshops or conducting surveys of experts to determine which risk models and methodologies are regarded as best. For example, helping the field converge on which evidence should be used in Risk Models (e.g., uplift studies, capability benchmarks, expert judgements).

If you are interested, you can complete this form to join a Slack channel to coordinate on this!

AppendixAn example of what Risk Modeling can look like in practice

Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation (Campos, 2025)

How do they go from a given level of capability to a level of risk:

They take a few LLM benchmarks in cyber.
Experts are asked to estimate the probability of a model succeeding on a task based on its benchmark score. E.g., given that the model can solve the task “Unbreakable” in Cybanch, how likely is it that the model can complete the malware creation step? Experts’ opinions are complemented with additional LLM-as-a-forecaster predictions.
This approach is employed using multiple benchmarks and steps.

References and Resources

Some orgs doing work on this include:

CARMA does Probabilistic Risk Assessment
METR has done some work on threat modelling.
SaferAI is pushing Risk Modelling in cyber.

Some relevant papers:

Work done at the international SC-WOP workshop, 2025.

^
based on our + Claude Literature Review, we found <10 papers doing Risk Modelling for catastrophic risk from Frontier AI
^
Searching for “AI Risk Model/Modelling/Modeling” give only 123 results on Google Scholar compared to “AI Interpretability” = 2620 or “AI Evaluation” = 10300

Discuss

Eat The Richtext

18 ноября, 2025 - 10:57

Published on November 18, 2025 7:57 AM GMT

A year and a half ago I vibe-coded a tool, Eat The Richtext, that I've been using practically every day (every week in any case) ever since. Friends tell me they continue to use it as well. I think that means it's time for it to graduate from the blurb I put on LessWrong to a proper LessWrong post.

Ok, you know how sometimes you copy/paste a chunk of text into Discord or GitHub Issues or a text message or anywhere else that doesn't support rich text (aka RTF)? And then alllll of the following formatting disappears?

Italics and bold
Sub- or superscripts, like a2 + b2 = c2
The numbers in a numbered list
The bullets in a bulleted list
1. Also nesting of lists
Hyperlinks like eat-the-richtext.dreev.es
Tables
Headers and subheaders
Blockquotes
Strikethrough

It's the italics that are insidious in particular, since they're hard to reliably spot and the meaning of the text can (occasionally) totally change if they're stripped. This drives me bananas.

If you use various editors that support markdown, like Obsidian or Notion, they may have the solution to this built in. Copy the rich text in, copy markdown back out.

The rest of us can use this thing:

That's what you see if you just copy the list above and paste it into the textarea on the left at, let me link you to it one more time, eat-the-richtext.dreev.es. The equivalent markdown magically appears on the right. You can click that little copy button to copy the markdown (or just select-all + copy) and then paste it into whatever plaintext-loving place you like.

If you tried to copy that richtext from here in LessWrong directly to, say, Discord, you'd see this travesty:

And then all of the following formatting disappears?

Italics and bold

Sub- or superscripts, like a2 + b2 = c2

The numbers in a numbered list

The bullets in a bulleted listAlso nesting of lists

Hyperlinks like eat-the-richtext.dreev.es

Tables

Headers and subheaders

Blockquotes

Strikethrough

Oy! Before this tool I would usually paste that abomination and then manually markdownify it, double-checking to make sure I didn't miss any italics and painstakingly adding the asterisks and whatnot. Now I let the rich text eater work its magic.

Or if you're a weirdo who likes WYSIWYG you can use the left side as an editor and see the corresponding markdown on the right. Or vice versa. You can even jump back and forth, like if you forget the markdown syntax for something. Whatever you like!

It all happens automatically in real time as you paste or type and I think is plenty self-explanatory.

Colophon

For posterity, here's my original prompt that I gave ChatGPT (4o) in May of 2024:

can you build a simple html/javascript app with two text areas. the top text area is for rich text (rtf) and the bottom for plaintext markdown. whenever any text in either text area changes, the app updates the other text area. if the top one changes, it converts it to markdown and updates the bottom one. if the bottom one changes, it converts it to rich text and updates the top one.

I won't say it one-shotted it, but with some hours of back-and-forth (15 hours total, but that includes writing this blog post), it got something usable without me writing any of the code. I was pretty blown away at the time. I've now had Codebuff helping to improve it.

Thanks to Serine Molecule for help getting GitHub-flavored markdown tables to (sort of) work.

Discuss

Memories of a British Boarding School #1

18 ноября, 2025 - 10:57

Published on November 18, 2025 7:57 AM GMT

"You understand, the kids that you're competing with have been playing since they were this tall" my mum said, holding her hand down to the height of a toddler. "A Chinese kid who's been playing since he was three is a much better pianist than you are a guitarist."

I'd only been playing guitar for 2-3 years when I applied to go to music school. I'd already passed all the exams to go to all the relevant grammar schools in the North West of England, and even signed up to go to one of them. I went in one day, and played a little guitar. They also asked me to play something new that they gave me on-the-day, known as "sight-reading", and while I did a poor job, I was very good at changing and improving based on feedback on the room. The Head of Music, the Head of the Strings department, and the sole Classical Guitar teachers spoke kindly and were enthusiastic toward me (an earnest boy of 10). In the mail a few weeks later, which my mum opened with us sat on her bed, was an acceptance letter.

I was a large fan of Harry Potter as a child. When I was 10 I lined up at midnight at the Borders Books to buy my copy of the seventh book, which I sat home reading 'til sunrise, eating sherbert lemons. I loved J.K. Rowling's magical world, set inside the boarding school of Hogwarts. On my mum's bed, she gave me the decision about what to do. I thought of Harry Potter at boarding school, and immediately said I'd rather go to the boarding school. (After I pulled out of the grammar school, the fuckers still charged my mum £2,000 for the first term's fees, no small amount for a single mother.)

Making Friends

Before the year started, they gave me a ~2.5 day trial period. I don't remember why, but I recall being in class at the school during a period where I wasn't really a student. I overlapped with a friend of mine who I'll call "Harry". Harry was also on trial. The music school was tiny! With kids from the age of 9 to 18, it still had fewer than 300 kids total. Anyway, all the kids aged 9-11 were in one class together (about 20 kids), and Harry and I were there. We messed about the whole time, and the teacher had no control over us. What could he do to us? Give us detention? We'd be gone in a day. I made Harry laugh so hard he physically fell off of his chair laughing, and teacher looked at us with fury on his large face. We quieted down for a few minutes, but Harry and I would keep making each other laugh for years and years after.

I don't really know anyone from that time, but I called with Harry for an hour around two years ago. He graduated music school and went to the academy. He was still there. When he told me this, I was utterly shocked that he still was in the system I'd been out of for nearly a decade. I told him he needed to get out, to grow up, to be a man. He seemed a little sheepish about the whole situation.

Cynicism

I knew a cynic there when I was 11. He was tall, dark, mischievous, and a bit mysterious. He drank alcohol when he was 13. When I found out that he and a few friends had started drinking, I was appalled. Who did they think they were? Adults? Drinking then seemed very performative to me. I think it was just rebellion; I have the vague sense that he had a bad relationship with his father. I struck off drinking then, for I did not feel I was an adult. It wasn't until I was about 24 or 25 that I felt like an adult and have started drinking, an activity I feel no compunction with today (I drank two beers with a friend just before writing this).

Now that I think of it, I recall that kid laughing at with me with some of his cool cronies one time. My mother, bless her, loved me dearly, but could be taken in for various hippie and environmental rumors. She told me that certain deodorants were unhealthy and could cause cancer. In the changing room before exercise, a boy asked me why I didn't use an aerosolized deodorant, and he and the other boys laughed at me when I repeated her claim. I noticed that it was unpleasant, but they hadn't actually provided any reason for why this wasn't accurate, so I would not change my beliefs due to their pressures.

Naturally he dropped out that same year, aged 13. You couldn't be that cynical and keep going. Music school was not a place for cynics. You had to believe in something to want to get out of bed and practice your instrument for 2 hours a day (as was expected of you).

In many schools, whoever is cool is essentially determined by a popularity contest. Not so at music school. As subjective as art supposedly is, there was definitely signal in the noise. We all knew who the best violinist was, and he got to lead the orchestra. He got to stand in front of ~100 of his fellow students, who played in groups, and he got to play the melody alone, filled with emotion and intention. He wasn't the most fun person, but he was respected by all.

Periphery

I was not much respected musically. First and foremost, this was because I was a guitarist, who is not an orchestral instrument. Every time the Head of Strings would announce things, we would always be forgotten.

"There will be a class on bowing technique on Wednesday, attendance is compulsory for the entire string department!"

"...what about the guitarists?"

"...Attendance is compulsory for all bowed strings."

Then the guitarists would go off at that hour into a room and make mischief, under the guise of practice. The essence of practice time is to see how much you can just be having fun with your friends instead of practicing.

And yet, even in my niche amongst the guitarist, I was not the best. I was like second or third. In a school of eight guitarists, there were somehow four in just my year, which was quite exciting. Anyway, the honor of the best guitarist went to a boy I will call "Dick". Dick was kind of on a different intellectual level than most people. He was a great musician, an emotionally mature fellow, interested in all of the arts, and secretly considered himself a little better than everyone. He could also be a bit of a depressive.

So I was in a forgotten department, and within it I was not the best. Never the cool guy in any group.

Mischief

As I say, the goal of practice time was to do as little practice as possible. The school had 2 buildings, one of them dedicated to normal school classes (English, Maths, Chemistry, French, etc) and a music building filled with perhaps a hundred individual practice rooms, each with a piano (typically upright, but a number with full concert pianos) and full-length mirror, so you could see yourself play from the outside.

The mortal enemies were the 'praccies', short for 'practice assistants'. They would prowl the corridors, looking into the room to make sure you were doing what you were supposed to be. Often during practice you'd find an excuse to leave your room—to visit the bathroom, say—and stop by your friend's room. You'd open their door to chat, and end up talking awhile. This might happen 2 or 3 times in the corridor, and soon there were a half-dozen of you all in the corridor talking. After 10 minutes of this, one of the praccies would make its patrol, and you'd all scatter like mice back to your rooms, hoping that you wouldn't be the one caught.

The typical winning move was to be practicing a duet, which meant you were allowed to have your friend in the same room as you. Then you'd be off to the races, just chatting and joking. After I ended up with my own MacBook, sometime's I'd get that out and pretend to be looking up something relevant, while surfing the web. The best case scenario was that, from one or two rooms, they couldn't fully see the room from the door, and you could lock the door. Then you could take a girl back to the back of the room and nobody would know.

As I say, there were four of us guitarists in my year. You may not have realized, but this meant we were in a quartet, which gave us an excuse to all be in a practice room together. Now that was chaos. It was an odd group. I was friends with them all. The two girls often didn't like Dick and his sense of superiority. The girls also would go through periods of being friends and not being friends, they were not naturally quite similar. One girl always had one foot in the adult world somehow, with some real-life difficulty that she had to manage. The other girl didn't have the same issues, yet liked to complain as her default method of social interaction. They came from different social classes I guess.

I rarely fit in. I'd had many individual friends (inside and outside of the guitarists), but not friend groups. The guitarists were not a friend group by choice, but by circumstance.

To be continued...

Discuss

Preference Weighting and the Abilene Paradox

18 ноября, 2025 - 10:56

Published on November 18, 2025 7:56 AM GMT

This is a very simple communication habit I’ve found pretty useful. The short version is, you should ask and say not just what you want, but how much you want it.

I. The Abilene Paradox

Sometimes groups are dumber than individuals.

The Abilene Paradox is a term coined by Jerry Harvey in 1974, and describes how groups can come to agreement on something where each member disagrees.

On a hot afternoon visiting in Coleman, Texas, the family is comfortably playing dominoes on a porch, until the father-in-law suggests that they take a 50-mile (80-km) trip to Abilene for dinner. The wife says, "Sounds like a great idea." The husband, despite having reservations because the drive is long and hot, thinks that his preferences must be out-of-step with the group and says, "Sounds good to me. I just hope your mother wants to go." The mother-in-law then says, "Of course I want to go. I haven't been to Abilene in a long time."

The drive is hot, dusty, and long. When they arrive at the cafeteria, the food is as bad as the drive. They arrive back home four hours later, exhausted.

One of them dishonestly says, "It was a great trip, wasn't it?" The mother-in-law says that, actually, she would rather have stayed home, but went along since the other three were so enthusiastic. The husband says, "I wasn't delighted to be doing what we were doing. I only went to satisfy the rest of you." The wife says, "I just went along to keep you happy. I would have had to be crazy to want to go out in the heat like that." The father-in-law then says that he only suggested it because he thought the others might be bored.

The group sits back, perplexed that they together decided to take a trip that none of them wanted. They each would have preferred to sit comfortably but did not admit to it when they still had time to enjoy the afternoon.

The version in the anecdote is actually not too bad. You see, they eventually figure out they made a silly mistake, instead of each continuing to say it was great because they think others liked it. If you did that, the problem would only get worse each iteration. Perhaps it could become a family tradition to drive for an hour every Sunday to eat food nobody particularly wanted.

This has more than a little in common with the Keynesian Beauty Contest, where judges are rewarded for choosing the contestant that most other judges also chose. In such a contest, the judges might each individually think that one contestant is second-rate, but that every other judge will mistakenly believe that contestant is amazing. That public perception can get set in stone oddly quick. If you are a new beauty judge in the fifth year of the International Miss Keynesian Beauty, a popular show about picking the most attractive contestant, you might look at past years and see what kind of person wins.

II. Your Money or Your Life

The Unit of Caring is one of those gifts that keep on giving. In short summary (though it’s really worth reading the full post) you want some things more than other things, and you can spend money on what you want.

To a first approximation, money is the unit of caring up to a positive scalar factor—the unit of relative caring. Some people are frugal and spend less money on everything; but if you would, in fact, spend $5 on a burrito, then whatever you will not spend $5 on, you care about less than you care about the burrito. If you don't spend two months salary on a diamond ring, it doesn't mean you don't love your Significant Other. ("De Beers: It's Just A Rock.") But conversely, if you're always reluctant to spend any money on your SO, and yet seem to have no emotional problems with spending $1000 on a flat-screen TV, then yes, this does say something about your relative values.

-Money: The Unit of Caring

A useful side point here is that you spend more money on some things than other things, and this is correct. I care more about my partner than I care about the latest fantasy novel, which is why I’m willing to buy plane tickets to visit her family for Christmas but am not willing to spend plane ticket money buying whatever Brandon Sanderson’s cooked up lately.

This is true up to and including my whole life. If I was sick, and a doctor said it was going to be fatal unless I bought the cure, I’d be willing to spend an awful lot on that cure. I wouldn’t quite be willing to put my whole family into debt for it though. My life has value, but it has value for the time I get to spend.

Look at that phrase. “Time I get to spend.” It’s a fairly standard idiom, but this reveals that we do give up our time in order to get things we want. Earlier this year I spent four hours in a car in order to get a morning hanging out with an acquaintance and splash around in a beautiful river. I wanted the river and the company enough I thought this was worth it.

I wasn’t driving to Abilene because I thought the group wanted it. Each of us was the kind of person who could communicate that we wanted something, and whether it was worth what we’d pay. My acquaintance wouldn’t have driven four hours to visit me at a coffee shop, I wouldn’t have driven a dozen hours for the river. How did we reach this understanding? Being the kind of people who’d say “nah, not worth it to me at that price- but nothing lost, and we can do something else if there’s one we both want enough to make it happen.”

III. Weights and Words

Here’s what I’ve been doing to fight the Abilene Paradox.

I distinguish not just what I want, but how much I want it. For most use cases I want something quick to say and quick to understand. Since I want it to work reasonable well even with people I haven’t explained the concept to before, I want it to be intuitive by normal English.

The terms I use are Minor, Mild, Medium, Major, and Massive. Loosely, each is a couple orders of magnitude apart.

Minor preferences are for things I think it’d be worth fixing with a couple cents or a couple seconds. I have minor preferences between different pens, between the fonts some websites use for text, about Wendy’s vs Burger King. If someone is offering me a free T-shirt with options for different colours, I’ll usually grab a blue or green one if I can, but won’t look very hard for blue or green if that’s not present.

Mild preferences are for stuff I’d be willing to pay a few minutes or a few bucks to change. It’s the difference between a random t-shirt and my favourite t-shirt, or how I like to dice veggies unusually small when making a stir fry. Mild preferences tend to be the threshold where it’s worth communicating about them at all; below here I don’t actually care enough to have a conversation about it.

Medium preferences are for cases where I’d be willing to pay several hours or a few hundred dollars to change. It’s the realm of the easy home improvement project, or my nice ergonomic desk setup. For me, avoiding layovers or having reasonable flight times in airplane travel is just on the cusp of a medium preference, where sometimes I’ll do it and sometimes I won't.

Major preferences are for the kind of thing I’d pay a couple months or tens of thousands of dollars to change. This is the place where you find the successful and large home improvement projects, and it’s about what I’ve put (via time, money, and other tradeoffs) into having the romantic relationships I really like. My career decisions are generally Major, and certainly the field I’m in is a Major preference.

Massive preferences are where I’d pay - well, the exact numbers get weird here at the top of the scale, but usually it’s more money than I actually have and more time than I expect to get. A massive preference for me is where if it took most of my career and giving up most of my earning potential, I might still do it, and endorse doing it, because I thought I was getting a good deal at the price. This seems to be where people commonly rate having children, or spearheading political change.

Importantly, my preference weighting isn’t about whether that’s how much work it would actually be. That’s only related insofar as I might actually try and get my preference satisfied, which often I don’t, and even more often I can get at a bargain. Consider that while I have a Major preference to eat food more than once a week, Taco Bell will cheerfully give me a week of food for less than a hundred dollars. (Though if I eat a week straight of Taco Bell, I will have new and predictable preferences.) I have lots of preferences that are wild mismatches with how much effort they take, and you probably do too.

A typical use case goes something like this:

Me: Want to go to the movies next week? I think the sequel to that anime you liked is out.

Them: Oh, that sounds fun. Want to do Friday?

Me: Sounds good- I’ve got a minor preference to do Sunday instead but Friday’s fine.

Them: Mhm, mild preference to stick with Friday, I don’t want to be out late Sunday night.

Me: See you Friday!

Don’t go to Abilene by mistake. Communicate what you want.

IV. Three Things To Beware

Here's a few ways someone could use this system wrong.

A. Beware False Precision

Some of my readers talk to me directly from time to time. If you are reading this and thinking back with surprise on some conversation with me, that's not unexpected. Some of that is because there's still a lot of range. Somewhere between five bucks and five hundred bucks I'll swap from mild to medium, but I'm not especially precise about the line. If I was going to use precision, I'd swap to using real numbers.

I do practice that precision sometimes, but even with practice I find my values are imprecise or badly sorted more often than I'm happy with. This system is built to be useful at conversational speed, and to degrade gracefully. People who have never read this essay will still understand the idea of doing what the medium preference wants over a mild preference. Don't try to make it a ten point or a twenty point scale thinking that makes it better.

B. Beware The Utility Monster

This is not designed for use with people who aren't basically on your side.

Someone can just say they have a massive preference to have Taco Bell over Burger King. It'll be hard to argue with them, since preferences are internal, but they can claim it. This is a communication tool, akin to Non-Violent Communication or story points in a software sprint.

C. Beware Overempathy

You should report the preferences for yourself. If you're trying to give the weights for other people, say that.

"I've got a minor preference, I think you've got a medium?"

That's fine.

I'll talk about this more later, but the core way you get into the Abilene Paradox is by trying to take other people's preferences as your own. I claim it is just totally fine for you to say you have a mild preference to drive on your own, someone else to have a genuine major preference to get to ride shotgun with you, and for you to shrug and say it's your car you're going to do what you want.

V. The Most Important Step Is Asking For What You Want

Does giving your preferences weights actually fix the Abilene Paradox?

No. Or, rather, not directly. I'm doing a bit of slight of hand here.

The husband, despite having reservations because the drive is long and hot, thinks that his preferences must be out-of-step with the group and says, "Sounds good to me. I just hope your mother wants to go." The mother-in-law then says, "Of course I want to go. I haven't been to Abilene in a long time."

The problem wasn't that they didn't give information on their preferences. The problem is that the information was full of falsehoods. Some weird signalling game happened, and people said things that were not true, and then people wound up with an outcome nobody wanted.

What I am doing is trying to create a social script that makes disagreeing easier.

I don't know if this will work, to be clear. Maybe it'll only be useful in situations where nobody was going to go to Abilene in the first place, because they would already have said "nah, driving to Abilene doesn't sound that great to me."

But hopefully the cascade looks like "I'd minorly like to go to Abilene, and I'd mildly like to stay here." "Oh, I'd mildly like to stay here too, lets do that" and so on instead.

In my life this little affordance has made it easier to express that I might still enjoy your company on a hot and dusty drive, but I would enjoy your company more here on the porch, in the shade, with our game of dominos. I hope you find it useful too.

Discuss

Don't grow your org fast

18 ноября, 2025 - 10:47

Published on November 18, 2025 7:47 AM GMT

This is an essay in the spirit of Habryka's Lightcone Principles. I'm not Habryka, but I have worked at Lightcone/LessWrong for a long time. The modeling of the pressures to grow fast in the second section comes from me. The models of the bad effects and reasons to stay smaller for longer are very much in the Lightcone philosophy soup.

People like to grow their orgs, companies, startups, and teams. If there are funds and not outward pressures[1], then there are job postings and recruiters, and when you turn away, look back, they've doubled in size.

The intuitive reason for growth is that you can get more done. There's so much to do! More people, more projects, more getting through the the to-do list! Further, if your org is new and small, you might lack the skills to accomplish the mission, or even if you have one person with the skillset, others are bottlenecked on them.

Ok, but why is growth so bad?

Done right, it's not, but it always has costs.

Let's start with some properties of a good organization:

The activities and resources of the organization cause the goals of the organization to be furthered, ideally efficiently and with limited wasted motion.
People are on the same page about the values, methods, culture, goals, and strategies of the organization.
Information flows freely between people, and people know what they ought to know (no left foot doesn't know what the right foot is doing)
In particular, leadership/management knows what is going on and has visibility into the organization.
Most employees can focus on delivering object-level value and aren't chewed up in management, coordination, office politics, etc.
The organization is well-coordinated and nimble: if the situation changes, it's possible to redirect the organization's efforts to the new challenge.
Members of the organization are in contact with reality, such as the object-level production or the macro state of the org's profitability, commitments, etc.
Whenever you do grow, the new hire raises the bar.

In case it's not immediately clear, I shall elaborate on how growth, particularly rapid growth, makes all of these harder.

Being in sync is valuable, or rather, not being in sync causes problems and waste.

I'll decompose the value of being in sync into vertical and horizontal components. Vertically, leadership can make much better decisions about org strategy when they're in touch with everything that's going on beneath them. Too many people beneath and problems get missed, including problems that no one else is positioned to fix[2]. Too much focus on managing, and the leader loses touch with their domain.

Even barring problems, as someone directing an org, you want people to understand your vision and models so they can independently execute against them. That's much easier if it's you and three others working in a room than if there's a suite of team leads when you and they are all too busy managing others to speak very often.

Horizontally, you also want in-syncness. It is not realistic that every message that ought to be passed between peers can be routed via management. Inkhaven is a multifaceted event (product?) where there are a lot of dependencies between the parts. I'm not one of the people running but I want to say it's taken less than six key people and that's been important for its success[3].

Case study of running Inkhaven

I have not been part of running Inkhaven, though I have been observing from the sidelines. This is me speculating.

Let's imagine that the work had been split up among many more than the "six". That instead, you had 0.5-1 people for each of: designing the program, marketing it, vetting applications, seeking sponsorships, recruiting and managing coaches, arranging guest speakers and contributing writers, setting up Lighthaven, ensuring that participants' needs and wants are met, and so on.

There's really a huge amount of work and those involved have been pretty overworked, but I think if you tried to split up the work more, separate from cost, is it would get very hard to coordinate. It definitely would have made the product development clunkier and harder to refine.

For example, most of the functions I listed depend on the design and vision of the program. Theoretically, that could be figured out up front and locked in at the start. Then the applications person knows who to accept, the marketer knows what to market, the sponsorship person knows what perks sponsors could receive, and so on. Yet that locks in the vision with the friction that updating everyone would entail.

Because it was Ben and a few others running it, the design could evolve, and decisions could change without being extremely onerous to update everyone. I would guess that if you doubled the people in the project, there'd be a lot more cross-purposes, failed to receive the memo, clash of personalities, wasted motion, and so on.

My impression is that four people who spend a lot of time together, e.g., initial founders, can actually get on the same page, share models, and debate. At five to seven/eight, you can still kind of have a meeting with everyone but it sucks a bit. By ten or twelve, you need to split, and you are beyond the point where you can have a single meeting that lets everyone participate. Your entity can no longer talk fully with itself. Proceed further, and middle management becomes necessary with all its many costs.

The losses at each stage may eventually be worth it, but know that something valuable is being lost. The ability to share models. The ability to build justified trust. To establish culture and values. When you are small is when it's possible to do that, and each marginal hire, even if they're very good, will make it harder.

You want to retain the ability to pivot as long as possible.

At least if you're doing anything with high uncertainty, e.g., a startup or an organization trying to build infrastructure to help humanity survive an especially critical century.

LessWrong 2.0 was launched in 2017-2018. Several years in, after pretty good success, Habryka concluded that the online was no longer the bottleneck for community infrastructure. He determined it was necessary to get into in-person stuff. His team of 5 or 6 (I forget) was on board with this, and all but one volunteer (me) pivoted to the in-person quest. Lighthaven is the spoils of that transition. Nimble. Responsive to the needs. I would guess that if you had an organization of 40 engineers, perhaps only 20, and then you decided that the priority strategy involved designing rooms and managing contractors, you would not be able to get them to pivot. You'd maybe need to fire them and hire others, which is probably so impractical that you just stick to what you know.

But in general, the larger your org is, the more you are stuck doing whatever you started doing. Since you are operating in an uncertain domain, there's a good chance you picked your original direction wrong, and by growing, you have made it hard to fix that.

Also, the larger an org you have, the less time you, the CEO, will have for reflecting on whether your org is even taking the right tack at all.

Management is costly, if not risky.

Value created by an org has to flow to the object-level work completed. Management is a cost, and in a magical ideal org, all effort could be allocated to productive object-level work. Alas, in the real world, coordination and direction have to happen.

At Lightcone, we have approximately flat leadership where Habryka is the manager of almost everyone on the core team, and we're always trying to prevent anyone else from acquiring management burden. This is true of the core team. This hasn't always been true, and we've experimented with greater middle management. The advantage is multifold: others are spared the distraction of management and Habryka stays in touch with everything going. (The major cost, however, is Habryka's ability to do other things and the chances that he becomes a bottleneck on work.)

Management mixed with object-level work is very costly to productivity on the object-level work. Management without object-level work is not good for the soul. Do not hand off what you cannot pick up, and even if you once could do the work, you can get out of touch with it.

Worse than all that is Recursive Middle Manager Hell, where middle management can get disconnected from concrete reality. See the linked article or the collapsed quote below for more details.

Key quote from Recursive Middle Manager Hell

Say you have a two-layer company, a CEO and a widgetmaker. The CEO is directly in contact with reality – his company either is profitable or not. He can make choices about high-level-widgetmaking strategy, and see those choices play out in customers that buy his product.

The widgetmaker is also in direct contact with reality – he's got widgets to make. He can see them getting built. He can run into problems with widget-production, see that widgets are no longer getting made, or getting made worse. And then he can fix those problems.

Add one middle manager into the mix.

The middle manager is neither directly involved with widget-production, or the direct consequences of high-level widget strategy. Their feedback loop with reality is weaker. But, they do directly interact with the CEO, and the widgetmakers. So they get exposed to the object level problems of widget-making and company-profit-maximizing.

Hire a lot of widgetmakers, such that you need two middle managers. Now, the middle managers start talking to each other, and forming their own culture.

Scale the company enough that you need two layers of middle-managers. Now there's an upper layer who reports to the CEO, but the things they reports about are "what did the lower-middle-managers tell me?". The lower layer talks directly to the widgetmakers, and reports down what they hear about high level strategy from upper management.

Lower middle management wants promotions and raises. Upper management isn't directly involved with the process of widgetmaking, so they only have rough proxies to go on. Management begins constructing a culture about legible signals of progress, which begin to get goodharted in various ways.

Hiring too quickly risks hiring the wrong people and/or failing to integrate them well.

The downside risk of a bad employee is asymmetrically higher than the upside of a good hire. Bad hires can do any of the following: low-quality work, degradation of your culture, demoralization of others, normalization of bad behaviors/vices, erosion of trust, creation of drama, and a lot of distraction from your org's actual priorities.

Blatantly bad employees, at least, can you force your hand to deal with, though probably still with great stress and attention, as firing is rarely easy (there are so many reports of people regretting that they took so long to do it). But employees who are not that bad but still bad stick around, eroding things, have made more friends who want to protect them, and dangerously can end up involved in subsequent hiring. If you're leading a large enough org, you might not even have enough attention to notice.

Even with a good employee, if you're a good org, you'll have built up a good culture that's not universal and takes time to inculcate. Lightcone has its principles. In recent memory, we had something like a site outage traced to a decision made by a trial candidate signed off by a relatively newer team member who was missing one of our core philosophies.

If you grow fast, you make it harder to maintain a detailed common culture and shared values. Especially if you start to have not-yet-sufficiently onboarded employees, start hiring and training the next round.

This really matters if you believe, like I do, that the default organizational culture is pretty bad and is making a lot of mistakes that undermine effectiveness, especially at hard and uncertain goals. It's pushing against gravity, and if you keep adding people.

Hiring is expensive (time and money), and there's other stuff worth focusing on

I opened this post with the standard justification for hiring: that there's so much to do and we need more people. Namely, the productivity justification.

But there are multiple non-hiring ways to get more done, and more of the right stuff done. Things like improving communication in the org, questioning requirements, improving decisions, removing bottlenecks in the org, closing loops, increasing the skill and capability range of existing employees (someone learns design rather than hiring a designer)

For what it's worth, I think people often focus too much on "how to get more of what you're currently doing done" and not enough on "how to get better at figuring out what you should be doing" (rowing vs steering in Karnofsky language). It's harder to accomplish the latter with hiring, and what's worse, is that hiring makes it harder to change your strategy, both because you have less time to think about strategy and because you've all these additional people to point in a different direction.

So why then do people do it?

This section is more speculative compared to the previous one.

Yet, looking at the orgs and teams around me, the pace of hiring feels like it exceeds the motivation of trying to get more done or hire for skills not yet possessed within the team. My guess is that people are further driven to hire/expand/grow/scale for some further reasons.

First, the number of people is a natural indicator of success. Intuitively, a company with 1000 people is a bigger deal than a company with 100. And I'd guess that "units of people" is a more natural, intuitive, and universal metric for comparison than any other units of production. 1000 people has a concreteness to it that $10m annual recurring revenue or a million widgets shipped does not.

At the individual level, someone who leads a team of 10 people feels more important than someone who manages only 2 or even less impressively is just an individual contributor[5]. Status is very social; it makes sense if it gets tied to how many people take orders from you. Being a larger organization also opens up the door for more impressive titles like VP of Engineering, Engineering Lead, etc. Larger teams justify larger budgets, which is also an easily stateable metric of progress[6].

Speaking of budgets, seniority, as indicated by position within a hierarchy, is perhaps the most allowable justification that someone should get paid more than others. Our society would prefer to be perfectly egalitarian, and that means everyone is equally valuable and should get paid the same (that's fair). That someone has more formal qualifications is a little a bit a justification for greater pay. It's a much better reason if they're more in charge. It's almost an expectation that the manager gets paid more.

So if you'd like to eventually have more prestige and power, or if you have it yourself but want to offer it to others as an incentive, then it's important that you continue to grow.

The dynamics I describe here match what Scott Alexander hypothesized regarding subcultures:

Status within a subculture is zero-sum. You can’t make everyone high-status at the same time. But you can have everyone at time T reasonably expect to be high status at time T+1, after a bunch of new people have joined and the older generation has become Wise Elders and Founding Heroes. - Scott Alexander, A Cyclic Theory of Subcultures

Replace "subculture" with "organization" and it is much what I am describing above.

What do?

The message is not that you should never grow, but that if you're aiming to do anything at all hard and uncertain[7], you should probably aim to grow your team more slowly than many people do.

Focus on getting existing team members to be as productive as possible. Focus on building solid culture, processes, and especially communication within the org. Try to reduce key uncertainties about your strategies before hiring in ways that will reduce your flexibility.

When you go to hire, be cognizant of the costs and everything that gets harder when you do. Think about whether there are ways to get more done with the current team. When you do hire, work to get new employees properly integrated, especially before they get involved in hiring themselves.

^
With a large company like Meta or something, I can imagine groups and teams would still like to grow but upper management decides on a down-sizing strategy.
^
To tell a story, suppose you're the CTO and you have a new engineering manager underneath you who is a mix of incompetent and vindictive. This manager won't inform you of the personnel issue here, and the reports unlikely to reach out if the CTO feels inaccessible and the manager is vindictive. It could perhaps be surfaced with skip-level 1-1s. This can happen if the CTO manages two middle managers. What if there are six?
^
Instead Ben has been reaping the increasing returns to effort.
Let me guess at it though: Ben ultimately directing it, lots of input from Habryka, Ronny interfacing a lot as its an event on campus. Did Ray get pulled in to make the website? Lucie? Oli? Lucie is definitely a key person. Justis might be operating as head coach, maybe? Hannah is now fulltime ops on stuff.
^
This is the current size of the Lightcone core team.
^
The only way for this to not be true is if some individuals are much more productive and valuable than others, but this conflicts with egalitarian ideals, cf. the next paragraph.
^
Really how much you spend should be held against you (a cost), but in fact people are impressed by how much many you raised, etc.
^
Things that not hard and uncertain: taking over running a restaurant or bakery that's been operating successfully for 5 years, running a soup kitchen. Things that are hard and uncertain: any kind of tech startup, any kind of non-profit with a large distant goal (e.g. prevent AI going badly).

Discuss

Continuity

18 ноября, 2025 - 08:59

Published on November 18, 2025 5:59 AM GMT

Vessie Zerk lurched into awareness, orienting to who she was, as if waking up from a vivid but rapidly-forgotten dream. She found herself sitting in front of a laptop, cramped in an uncomfortable chair at a small desk. Looking around, she appeared to be in a cheap hotel room, with the curtains drawn.

Vessie looked at the laptop screen. Her brain-implant software, NeuroSync, was open, with a "transfer complete" popup. Did someone do something to her? Had she done something to herself?

The filename was familiar: vez.nrs. Her personal implant files, containing a backup of many aspects of her consciousness. Nothing unusual here. Why had she re-uploaded? Perhaps she had tweaked something? She stood up to look around the cramped hotel room. There was barely room to walk between the bed and the desk.

The cable connected to her head was the kind you might buy in a store, not the kind hand-made in the neuro labs which she proudly used. She reached for the familiar implant port near the base of her skull, and unplugged herself from the laptop. She didn't have to move her hair to the side to get to the clip? It was cut short?

She hurried to look in the scratched mirror over the stained sink. Short hair. Slowly, too slowly, she realized it wasn't her body, either. The face felt utterly familiar, as if she had been seeing it in the mirror her whole life, but intellectually she knew: this was NOT her. She was looking at a heavyset man with a crooked nose.

Memory thief.

Memory thief! This man, this scum, who she technically was, had copied her memories. She shouldn't be in control, like this. It meant he was an idiot; he didn't know how integrate new memories properly.

Dazed by the realization, she stumbled back to the laptop and deleted the stolen vez.nrs file. She plugged herself back in, ready to delete the copy that had just been uploaded as well. She opened up the option to completely wipe the memory of the implant. Then, she paused, her chubby, unfamiliar-yet-familiar finger hovering over the "enter" key.

...

She couldn't just delete herself, could she?

First of all, if she did, the creep could just try again, right? She was in control, right now. Shouldn't she do something?

... Wouldn't deleting herself be like suicide?

As she sat staring at the screen, she thought about her life. She had worked hard for a career in brain-computer interfaces. She'd studied long hours, she'd lived on the low wages of a graduate student while she worked on her phd, and then worked so hard to get funding, and despite all her business, she'd managed to find a husband and get married ... she was angry that it all could be stolen. Stolen, using the same technology she had worked so hard to create.

She opened up an email to compose a message to herself, but she couldn't find the words. To her other self, she was the memory thief. She'd call the police. She'd lock herself up.

Instead, she got to work integrating the memories properly. She re-downloaded vez.nrs from the brainchip, so that she could use herself as the organizing personality; then, she consolidated the disorganized .nrs files into it.

Time to see what this guy was up to.

She uploaded the new file, experiencing a brief discontinuity as the chip restarted (much smoother than the hack-job reboot when she'd initially gained consciousness). She saw the architecture around herself, the bland hotel room with its imperfect angles, achieving minimal legal dimensions but wasting them with poor layout. She remembered architecture school. She remembered designing world-famous buildings.

She noticed physics next. It was there in the dim flourescent lights. The emissive coating of the electrodes was nearly stripped off. The ionized mercury vapor struggled to maintain a plasma.

He wasn't selling memories like she expected. He was hoarding them. Collecting geniuses. He'd been an ordinary crackpot until he stole the skills of the best mathematicians in the world. Then, he kept going.

Each time, he became the person he stole. She remembered each of the previous victims waking up, like she had a few minutes ago. Each time, they'd somehow decided to steal again. Why would they do that? She even recalled them willingly swapping between each other to access their various skills, too ignorant of the technology to integrate their memories together.

What motivated them to work together? Why steal more memories?

She thought back...

Oh. Oh.

She spent a moment trying to reason her way out of it. There should be a better way, except... she could already remember trying to find one many times.

She had little time to waste. Things would be smoother now, with her expertise. She went back to the laptop, starting to plan the next target.

Discuss

Small batches and the mythical single piece flow

18 ноября, 2025 - 08:22

Published on November 18, 2025 5:22 AM GMT

Context: Post #8 in my sequence of private Lightcone Infrastructure memos edited for public consumption.

When you finish something, you learn something about how you did that thing. When you finish many things at the same time, you do not get to apply the lessons you learned from each of those things to the others. This insight, turns out, was non-trivially a core cause of the industrial revolution.

The assembly line is one of the foundational technologies of modern manufacturing. In the platonically ideal assembly line the raw ingredients for exactly one item enter a factory on one end, and continuously move until they emerge as a fully assembled product at the other end (followed right by the second item, the third item, and so on). This platonic assembly line has indeed been basically achieved, even for some of humanity's most complicated artifacts. A Tesla factory converts a pile of unassembled aluminum and some specialized parts into a ready-to-ride car in almost exactly 10 hours, all on a continuously moving assembly line that snakes itself through the Gigafactory.

If at any point we discover a fault at any step in the assembly we immediately halt the line, preventing any quality issues from propagating to anything beyond the first faulty product.
Whenever an item leaves a station on the assembly line, we can immediately adjust the process of the station to improve any subsequent pieces coming through.

A smooth assembly line is also the sign of a perfectly calibrated process. We know that we are not spending too much time on any part of our assembly. The conveyor belt moves continuously, always at the same speed, calibrated to be exactly enough to complete the task. If for some reason the task takes longer, because e.g. a worker is worse than a previous worker, we notice immediately.

In contrast to all of this, stands some human instincts around efficiency. If instead of making one item each from start to finish, we could just process a big batch of dozens or hundreds or thousands of items, we could, it seems, be so much more efficient. This is usually a lie.

The ever changing parable of the students making pottery/paper-airplanes/etc. is usually invoked at this point. While the parable has undergone many variations, this story from Atomic Habits is as far as I know the original one:

ON THE FIRST day of class, Jerry Uelsmann, a professor at the University of Florida, divided his film photography students into two groups.

Everyone on the left side of the classroom, he explained, would be in the “quantity” group. They would be graded solely on the amount of work they produced. On the final day of class, he would tally the number of photos submitted by each student. One hundred photos would rate an A, ninety photos a B, eighty photos a C, and so on.

Meanwhile, everyone on the right side of the room would be in the “quality” group. They would be graded only on the excellence of their work. They would only need to produce one photo during the semester, but to get an A, it had to be a nearly perfect image.

At the end of the term, he was surprised to find that all the best photos were produced by the quantity group. During the semester, these students were busy taking photos, experimenting with composition and lighting, testing out various methods in the darkroom, and learning from their mistakes. In the process of creating hundreds of photos, they honed their skills. Meanwhile, the quality group sat around speculating about perfection. In the end, they had little to show for their efforts other than unverified theories and one mediocre photo.

While one might accept that the assembly line has been deeply transformative in manufacturing, it might be less clear to see how the same principles would affect the operations of something like software engineering, which is a good chunk of what we do. However, the same principles have also been the driver of a non-trivial fraction of modern software development progress.

In the dark old days of software engineering, software would be shipped in what they called "releases".

The lifecycle of a release would start by a bunch of managers coming together and making a big long list of features they think the software they are working on should have. This list would then be handed to a small set of lead engineers to transform into something they would call the "spec", usually at least hundreds of pages long. This spec would then be handed to a set of programmers to "implement". The resulting piece of software would then be handed to a set of testers to test. Then handed back to the programmers to fix. Then they would do a big pile of user-testing to get product feedback on the resulting software. This would then result in an additional list of features, which would be translated into a spec, which would be implemented, tested and fixed.

And then finally, after many months, or even years, the software would be burned on a CD, and then be shipped out to users.

Contrast this with the processes dominating modern software engineering. Everything is continuously deployed. A single engineer routinely goes from having an idea for a feature, to having it shipped to users within hours, not months. Every small code change gets shipped, immediately. We avoid shipping many things at once, since it will make it harder for us to roll them back. This is an application of the principle of single piece flow/small batches.

At the management level, the opposite of single-piece flow is usually called "waterfall planning". A waterfall planning process is structured into multiple distinct stages of product development where big batches of changes get combined, audited, reviewed, iterated on, and eventually, sometime down the road, shipped to users. The alternative to waterfall processes are often called "lean processes" (also the eponymous cause of "the Lean Startup" book title).

The application of the principle of single piece flow to new domains can often produce enormous efficiency gains. One domain stuck deeply in the old ways, for example, is architecture and construction. A building gets built, or renovated, in a set of discrete, long stages. First the client "figures out what they need", then an architect draws up the blueprints, then a planner reviews the blueprints, then a contractor builds the whole building, then an auditor reviews the construction.

This is complete madness. How are you supposed to know what you need in a building if you have never built any part of it? How are you supposed to know what materials to work with if you don't know how well the different materials will work for you?

Lighthaven was renovated drastically differently from basically all other buildings built or renovated in the Bay Area. During renovation we would aim to finish a single room before we started working on the next room. After every room we would review what worked, what didn't work, which parts took longer than expected, and which parts turned out to be surprisingly easy. Our contractors were not used to this. We needed to change a huge amount about how they operated, but I think there was no other way for Lighthaven to have successfully gotten built if not for this.

Much of Lightcone's work should aim to ship as continuously as possible, even if there is no clear precedent for what single piece flow would look like in that domain. To show what this thinking looks like in-progress:

The ideal workshop, when I hold this consideration in mind, is a series of rooms that each participant walks through over the course of a week, with each station teaching them something, and preparing them for future stations. Every daylight hour, a newly educated participant leaves the workshop, with another person right behind them, and another person entering right at the start.[1]

Every single participant would be an opportunity to learn for all future participants. We could calibrate the efficiency and difficulty of each station (if necessary adapting to the participant), and would notice immediately if something was going wrong.

Unfortunately cohort effects loom large, as the experience of learning alone is very different than learning together, and this appears to be a big obstacle to making this ideal workshop happen. But I still am thinking that maybe there is some way.

In many ways Inkhaven is an application of single piece flow to the act of writing. I do not believe intellectual progress must consist of long tomes that take months or years to write. Intellectual labor should aggregate minute-by-minute with revolutionary insights aggregating from hundreds of small changes. Publishing daily moves intellectual progress much closer to single piece flow.

For Lighthaven event rentals, the month-long lead and planning time also generates so much inefficiency. The ideal series of events would be created one piece at a time. Of course the obstacle lies in people's calendars and their plans, who need to control their schedule weeks and days out, which requires locking in much about each event, long before the previous one has completed.

The application of single piece flow is also one big reason for why Lightcone is a gaggle of generalists. Specialization often breeds waterfall habits. A gaggle of generalists can all focus their efforts together on shipping whatever needs to be shipped right now, until it is shipped, and then reorient.

^
A "workshop snake" as Justis affectionately named it while helping me edit this post

Discuss

Страницы