Вы здесь

Сборщик RSS-лент

The main impact from automated AI production: concentration of power?

Новости LessWrong.com - 31 мая, 2026 - 23:42

There’s a lot of talk about automated AI R&D and the like. It’s been discussed since at least 1965 when statistician I.J. Good coined the term ‘intelligence explosion’:

an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make. — Good, 1965, Speculations concerning the first ultraintelligent machine.

Since the 2000s, a few researchers (recently increasingly many) have taken seriously the prospect that some kind of intelligence feedback mechanism could kick in around the time that AI becomes competent enough to contribute to its ‘own’ (or rather, its successors’) development.

This is increasingly recognised as a prospect we should take seriously. It’s been discussed somewhat furtively by leading AI developers for many years, but in late 2025 it became a public talking point. Sam Altman, CEO of OpenAI, recently set a target of fully automated AI-improving-AI by 2028. Jack Clark, co-founder of Anthropic, writes ‘AI systems are about to start building themselves’.

I’m not going to address the resulting prospect of speedup here, which I think is real, though perhaps surprisingly modest.[1] (If, like some of my friends, you think I’m underweighting this prospect, you should probably be even more concerned about concentration of influence.[2])

Rather, this note is briefly making the rather simple (but I think rather overlooked) observation: plausibly automated AI R&D would enable a (perhaps severe) concentration of influence… and this could be the most important effect to be paying attention to.

It’s a fairly obvious point once made. A more thorough analysis might shed light on how likely this is, the most important consequences, and any leading indicators to pay attention to. I’ll only make some starting gestures toward these.

As I presaged in Engineering a Safer World,

R&D automation…: fewer human participants means fewer whistleblowers, less internal scrutiny, less governance and decisionmaking robustness. More concentration of that influence. That could mean more single points of failure. It’s famously difficult to maintain a conspiracy of more than one or two people: ‘two can keep a secret if one is dead’, as they say! And besides conspiracy, compared to larger teams, individuals and small groups may be far more susceptible to capture, coercion, corruption, or plain foolishness and rash decisionmaking.

Whose influence?

To be clear, you don’t need to be concerned about any specific people currently in positions of particular influence to think that concentration could be very impactful (and concerning).

The wisest supreme leader is still a single point of failure[3]. Especially during a period of tumult, leaders can be displaced and reporting structures revolutionised. Concentration of influence — even the prospect of that concentration — simultaneously raises the incentives for corruption and power struggle (while raising the stakes for the rest of us). Meanwhile, power corrupts and a centralised decisionmaker can rarely account for all the relevant considerations, even if they mean to. In AI, we’ve already seen company leadership falling out over differences of opinion, fighting it out in the boardroom and in court, and even clashing with governments. Turf war over the means of production of AI (which may increasingly equal the means of production full stop).

The point is that losing checks and balances can be a problem whoever is at the top[4]. As in politics, so in the heights of industry — which, today, and increasingly, firmly includes AI.

Suffice to say that I imagine the kinds of roles in the AI oversight chain which might be (nominally, initially) preserved longest are senior research directors, company executives, and closely-involved regulatory or executive government officials.

Perhaps more important: whose influence lost? Directly, the skilled human labour which today is involved in researching, engineering, and deploying frontier AI systems. Via them, their wider research networks, social ties, and any societal oversight they might have provided (e.g. whistleblowing to journalists or authorities). Plausibly also the marketers, account managers, even lawyers and others who interface with what a company is doing with its AI products, and their networks. (If not themselves replaced, they’d lose collegial understanding and influence via the replaced researcher-implementers). In short: everyone else.

Sketched using claude.ai

Concentration without (overt) malice

Why shed the researcher-implementers, the deployment engineers, and others (or sideline them from the important ‘real’ responsibilities)? It doesn’t have to be a power-grab, at first (or even at all) — it’s just good business! Human employees (especially software experts) are incredibly expensive compared with AI.[5]

Beyond this, analogously to areas where AI already surpasses humans (like chess), human participants may positively get in the way, slowing and compromising what would not only be a cheaper, but a more effective AI-only workflow.

A company might choose to instead eat this cost — perhaps it prefers to exhibit loyalty to current staff, acknowledges concerns around erosion of human oversight, or is inherently conservative in that way. But in a competitive environment (either commercially or in strategic and security terms), that may be tantamount to inviting defeat or irrelevance, if competitors are charging ahead.

At some point shareholders and other stakeholders might think it irresponsible not to embrace a replacement AI workforce. Ultimately this could add up to far fewer people having insight into what’s going on or a say in its direction. Even if you can’t (or don’t want to) immediately fire them, you might deliberately or even passively sideline your human employees as more and more work is carried out by machines.

Influence over what?

Notably, board, investors, auditors, journalists, and regulators (who already often suffer an information scarcity) — and the broader society they represent — could also be substantially cut out if logging, analysis, PR, and incident reporting are automated… This might only require the usual levels of corporate paltering rather than special malice. It might even be prosocially-motivated if the relevant actors are (justifiably or not) concerned about a power-grab by government. (It could of course also be deliberately selfish.)

Conversely, a government or regulator which strong-arms its way into the midst of the development process might itself become the locus of concentrated influence. (This in turn might be superficially or genuinely motivated by national security concerns.) The more automated a firm, the easier it may be to seize, either overtly or through more subtle channels. When employees are critical pieces in a production process, that’s a taller order.[6]

Sketched using claude.ai

I’m not suggesting any of this leads (immediately) to influence over all the world (though if you do take over the world, please consider my friend Owen’s advice for what to do if you find yourself in that situation). Nor even over an entire economy or state. But minimally, this looks like highly concentrated influence over a frontier AI company: presently a rare, increasingly lucrative, and burgeoningly politically-influential machine.

Activity at that frontier — the decisionmaking by leading AI developers — stands a chance of being among the most economically, politically, and strategically important activity in the coming years. If the cat is out of the bag with AI, that frontier, steered wisely, might be an important part of a societal resilience and readiness process (compare the recent Project Glasswing in cybersecurity). Used foolishly or even maliciously, the frontier might degrade societal capacity right at a time when it could be most important.

Influence… or power?

To repeat: I don’t think it’s given (or even particularly likely) that the starting possessors of a hypothetical concentration of influence would remain in control. From the point of view of a rogue AI, for example, control over a frontier AI company might be a delicious opportunity for expansion, defence, and societal dependency — ideal routes to escalated un-unpluggability. Likewise from the point of view of a power-hungry politician or corporate psychopath. Notice that corruptibility and coercibility are perhaps more plausible with fewer hands: it’s less practical to threaten a whole team or company than a single person.

So if this hypothetical automation-driven concentration of influence is a concentration of power, I don’t expect the ultimate wielder to be worthy of it. This marks a true difference of opinion with some others I know, who might think it more likely that, say, some current AI company leader could both accrue and keep hold of this level of power, and also use it wisely.

The obvious paths to hard power via AI are in surveillance — newly scalable with AI analysts scrutinising as many people’s movements as desired — and in development and automated production of flexible, high-range tools of coercion like weaponised drones. Other kinds of power might come from positions of economic leverage or narrative and propaganda dominance, conceivable with advanced-enough AI and without societal defences.

Concretely, what might be different?

Let’s really quickly gesture at two cases which might be substantially different in a world where AI production is largely automated.

Imagine a near future: frontier AI company revenues are in the 100s of $billions or $trillions, AI services are used in civil services, executive decisionmaking, faith leadership, … That’s a lot of points of leverage for subtle manipulation. Suppose a leader (selfishly or under pressure) intends to inject secret loyalties into their products. If they’re sole controller without checks and balances, that could be trivial, with obvious concerning consequences. Without automation-replacement, there are perhaps dozens or hundreds of people with reasonable oversight per major release, and several people even for any given small change. That’s a far more difficult system to twist.

Alternatively, consider a company on the brink of truly autonomous, general, human-surpassing AI: a huge responsibility. Many ambitious but underwrought alignment targets have been suggested for such a project, many of them foolish, perhaps fatally so. A leader hubristic without serious checks might withdraw from critique, double down (perhaps sycophantically egged on by AI toadies), and drive through a ruinous agenda. Society at large might be none the wiser (or at least insufficiently alert and able to intervene). A leader willingly taking risks (perhaps with a selfish or misguided vision of unfathomable rewards) would be even more of a problem. On the other hand, a leadership more dependent on staff and teams may be subject to more psychological and cognitive support, scrutiny, moral encouragement, and so on. A literal healthcheck. Of course, human teams can be subject to concerning groupthink (not unheard of at AI development companies!), but the more off-base a suggestion is (ethically or pragmatically), the more likely it is that the collective intelligence of the team corrects it.

Any number of scenarios can be considered; rarely do they appear more promising from a societal perspective if a single or a few leaders are insulated from scrutiny and critique.

What’s the outlook?

It looks very unclear to me. Plausibly it’s in the balance and decisions and conversations now will shape how this plays out!

There are some reasons to think that (near-) automation of AI production could instead produce equal or even improved societal oversight.

For one, the sometimes unintuitive effect of automation is to increase per-worker productivity, making it more worthwhile to bring more humans into the process (even on myopic financial grounds). Historically this effect has anyway often presaged a sharp decline once automation reaches a sufficient level. This might make concentration actually more difficult for a time. I don’t expect this to play out for very long in AI frontier research[7], but I’m somewhat uncertain about this.

Directed sensibly, a glut of researcher-grade AI worker-equivalents could be tasked with analysis, scrutiny, decisionmaking robustness, even auditing and whistleblowing. There’s nothing in principle preventing this, for any given level of trustability in our AI. Perhaps we’ll end up bringing more people into the process in the increasingly important, complementary roles of oversight, decisionmaking, and direction-setting. These folks could be made more effective with AI tools and assistance. That’s a decision to be made.

As I suggested earlier, leaders perhaps soon faced with these temptations may balk, either at the moral prospect… or at the pragmatic prospect of being subsequently replaced, captured, or corrupted in the ways I hinted at.

Perhaps it’s a tenuous balancing act. Internally to a given AI producer, we might want to guarantee a large enough ‘in the loop’ workforce that we can trust internal deliberation and whistleblowing and so on to keep things on the rails. Outside that, we might want to avoid no-holds-barred proliferation of AI-producing capability — but we probably want at least a few developers to keep each other in check and reduce monopoly or kingmaker dynamics.

I don’t expect any such concentrations of influence to play out overnight. Loss of control is a process, not a moment. AI development organisations — and society at large — should be paying attention and having the necessary conversations. Much to look out for, and perhaps much to look forward to!

  1. ^

     Task-relevant data and compute look like the more biting bottlenecks, and research taste accrues mainly through expensive experimentation.

  2. ^

    (unless you expect such accelerated singularity that influence prior is all that matters)

  3. ^

    And may I be forgiven for slandering the current crop of potential supreme leaders as not very wise

  4. ^

    I would note that we usually need some concentrated executive and representative power in various roles in society — but we want those roles to be sufficiently monitored, and we want the processes which move people into (and out of) them to be sufficient to produce worthy selections.

  5. ^

    This ‘humans expensive, AI cheap’ point is worth taking some care over. Contemporary frontier AI training is getting more expensive by the year. Running the best AI can be increasingly expensive, because ‘thinking harder’ is an effective way to scale capabilities on current margins. But once a capability is unlocked at the frontier, it typically becomes rapidly and radically cheaper, due to ongoing rises in compute efficiency and the ability to distill long thoughts into quicker reflexes. Further, you only need to train AI to do something once, and then it can be copied and run as many times as you want, it doesn’t need to sleep or take family time or get sick etc.

  6. ^

    For example, witness the periodic furore among tech company employees when their work is used toward military ends (latest).

  7. ^

    One reason is that this research does not scale very well in parallel, so returns diminish steeply per worker — and we’re already imagining AI that can substantially slot in as a cheaper replacement worker in most cases anyway.



Discuss

A Song About No

Новости LessWrong.com - 31 мая, 2026 - 23:40

When Lily was about two she told me she wanted a song about "no". This was ten years ago, and I don't remember why she wanted this, but I made something up:

This is a song about no.
This is a song about no.
This is a song about no, no, no.
This is a song about no.

The song goes... No no no no no no no no no.
The song goes... No no no no no no no no no.
The song goes... No no no no no no no no no, no, no.
This is a song about no.

It's useful anytime the kids want a song about something I don't know a song about. For example, it often served as a song about "turtle". Of course the more syllables the subject has the harder it is to sing, but that just makes it more fun.

youtube

I applied my nascent music writing skills, and tried to set it down in dots:



Discuss

Visualize Cyclical Structure in Llama Model

Новости LessWrong.com - 31 мая, 2026 - 22:39
Summary

Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what structures we can find for different concepts in LLM’s. I made a tool allowing anyone to generate and visualize 3d structures for some arbitrary concept. The tool currently has support for cyclical and smooth structures, extracting activations at any given layer in the Llama-3-8b model. Check it out here:

https://neural-geometry.vercel.app

Introduction

Traditionally, mechanistic interpretability has focused on steering models with linear direction vectors. Underpinning this is the Linear Representation Hypothesis, which states that all features in language models are collections of sparse linear directions. While linear activation steering has proven useful to shift model behavior, it can also produce unexpected behavior. When the scalar's magnitude becomes too large, the language model often breaks down, producing garbled and incoherent text. Personally, I do not think that the linear representation hypothesis is sufficient to explain what is going on mechanistically in these large language models.

Recent research from Goodfire has focused on exposing the smooth geometric structures that form for certain ordered concepts, such as days of the week, seasons, age, temperature, and more. Recurring ideas like days of the week and seasons form smooth cyclical structures, whereas linear concepts such as temperature and age form arcs. The work here is directly inspired by Goodfire's paper and research.


Extracting Activation Vectors

One of the most important steps in this pipeline is extracting meaningful activation vectors from carefully constructed prompts. For example, Llama may be prompted, "What day comes 3 days after Tuesday?" or "What day comes before Thursday?" Ground-truth labels are simply the correct answer to the prompt. Activation vectors are extracted from the last token to capture maximal semantic meaning.

The neural geometry tool used the GPT-4o-mini model to generate a multitude of prompts with ground truth labels for any given concept. These prompts are often favored for cyclical and linear concepts, where semantic meaning can easily be captured. One potential issue with this style of prompting is that we may be imposing a cyclical structure through our choice of words when there otherwise may not be one. For example, prompts such as "What day comes two days before Wednesday?" when repeated across every weekday, may impose a cyclical structure when there otherwise may not be one. It would be interesting to explore whether the same structure emerges when a different style of prompting is used.

Projection Down to 3D

Raw activation vectors are first projected down to a 64-dimensional space. Here, the vectors are averaged for the same ground truth labels to produce centroids. These centroids are more stable in terms of their semantic meaning than any individual vector alone.

We do not project directly down to 3D; 64D serves as an intermediate step. The purpose of projecting down to 64 dimensions is to reduce most of the noise that exists in a high dimension, like 4096. Important semantic meaning may only be encoded in a small number of dimensions. Computing centroids in 64D is also much more computationally stable than computing in 4096. 64 is a somewhat arbitrary dimension; 32 and 128 can also work just fine.

A smooth curve is fitted through the centroids that are produced in 64 dimensions and sampled at 300 different points. The resulting spline curve is then projected down to 3d via PCA. This is the smooth curve that is displayed on the website after the full pipeline is finished.

Conclusion

Overall, I find this to be one of the more interesting research directions coming out in the field of mechanistic interpretability. While linear directions have been shown to causally influence model behavior, assuming that all features are simply linear direction vectors seems to be a nascent theory of interpretability and alignment in general. Rather, the structures within language models are likely much more complex and have nonlinearities, which is why models break down when steered too far. I am excited to see where this research leads in the future. In the meantime, I hope this tool can make this field of research more accessible for everyone.







  1. ^




Discuss

Why I think evals are pretty important and most worth working on (for me)

Новости LessWrong.com - 31 мая, 2026 - 22:31

An application response I wrote! Feel free to leave feedback!


What are you most concerned about when it comes to risks from AI?

I’m most concerned that many people will be harmed very soon, and particularly that we won’t know why. Since politics and government dictate public life, solving said harm would require palatable translations of technical and sociological knowledge by experts for institutional changemakers to act on.

However, evaluations meant to exact such knowledge are systematically unreliable. Anthropic’s BrowseComp “benchmaxxed” by independently both achieving eval awareness and managing to scrape the specific eval it was being tested on. Human-designed audits may also structurally indicate evaluation (Gao and Kreiss), with distribution shifts making evaluation paradigms systematically inaccurate to general, out-of-lab use. Even seemingly optimistic advancements (Constitutional Classifiers) demonstrate the insufficiency of pure output-level evaluations, as safety now necessitates interpretability of internal activations.

Capabilities risks in particular are also fast accelerating, enough to potentially saturate even robust metrics like METR’s time horizons (Cotra). Certain MLE advancements (Joo et al.) are regarded “surprisingly” effective; implicitly, neither designed for nor necessarily predictable from first-principles. New findings like the Platonic Representation Hypothesis even conjecture that unintended capabilities improvements are actually systematic, as multimodal models converge to a shared statistical representation of reality. Some scholars (LeCun et al.), by contrast, purport that instead of convergence to general intelligence, various models trained to various specialisations will constitute a more legible and steerable “Superhuman Adaptable Intelligence”. This, however, is exactly my concern: not only are domain-specific superhuman intelligences structurally impossible to oversee (novice-grandmaster problem), but aforementioned evidence shows broad, unintended capabilities may arise from ostensibly unrelated changes or optimisations. Capabilities aren’t decomposable into what we can measure for, and certain framings of AGI (or “SAI”) like LeCun’s might obscure that fact.

AI systems as information/thought filters have sweeping social impacts; empirical studies show systematic LLM bias in news summarisation (Savgira et al.). Widespread adoption may enhance manipulation of public opinion and structurally constrain “responsible AI” within institutional profitability (Mitra). Broadly, my concerns regard our capabilities/risk evaluation methodologies and ontologies being systematically wrong. Without enforcing reliable ground truths, we risk suffering every technical problem at once as scientific voices may fail to move institutions away from trajectories of harm.


Works Cited

BrowseComp: https://www.anthropic.com/engineering/eval-awareness-browsecomp

Gao and Kreiss (case study of gender bias in LLMs): https://arxiv.org/pdf/2509.04373

Constitutional Classifiers: https://www.anthropic.com/research/next-generation-constitutional-classifiers

Cotra: https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities

Joo et al.: https://arxiv.org/pdf/2602.15322v1

Platonic Representation: https://arxiv.org/pdf/2405.07987

LeCun et al.: https://arxiv.org/abs/2602.23643v1

Savgira et al. (no link): What Stays and What Goes: Auditing the Impact of LLM Summarization on News Partisanship. Pavel Savgira, Elisa Kreiss, Homa Hosseinmardi. CHI conference on Human Factors in Computing Systems: Late Breaking Work 2026.

Mitra: https://disjunctionsmag.com/articles/why-leaving-big-tech/




Discuss

Financial Costs of an AI Pause?

Новости LessWrong.com - 31 мая, 2026 - 21:55

I’ve analyzed the near-term economic effects of an AI pause, out of concern for my investments, and a desire to predict how strong political opposition to a pause is likely to be.

My median estimates: The S&P 500 will drop 27.8%. AI subsectors will drop 34-69%. Interest rates will rise at a much slower rate than would be the case without a pause.

The specific numbers depend on some fairly arbitrary assumptions. So please read this post in order to get a feel for how the results depend on the assumptions. I’ve tried to keep the assumptions reasonable, but some of them will prove to be wrong. My most controversial assumptions reflect an expectation that both markets and voters will be surprised at how powerful AI is, mainly in 2027.

For the full model, along with many explanatory comments, see the Python source code here (zip file).

This conversation with Claude clarifies my reasoning in more detail than most people will want.

Sensitivity to Assumptions

Here’s how my model says the impact is influenced by changes in assumptions. Numbers are for the immediate change in the S&P500 and in the stocks of hyperscalers (Microsoft, Google/Alphabet, Amazon, Meta, and Oracle).

AI economic centrality


low (0.4)

medium (0.7)

high (0.95)

SP500

-22.5%

-27.8%

-31.8%

Hyper

-26.8%

-34.1%

-40.4%

How central is AI to economic growth? Low means AI is one technology among several. High means AI is the dominant driver of growth, and interest rates are pushed up by massive AI-related capital demand.

training dependence


low (0.25)

medium (0.5)

high (0.8)

SP500

-22.2%

-27.8%

-31.2%

Hyper

-26.3%

-34.1%

-38.9%

How much of AI’s near-term economic value requires new frontier training runs? Low means most value comes from deploying and refining existing models. High means the next major value unlocks require fundamentally new capabilities.

S&P 500 immediate impact: AI centrality × training dependence



Train: low

Train: medium

Train: high

Central: low

SP500

-18.9%

-22.5%

-25.0%

Central: medium

SP500

-22.2%

-27.8%

-31.2%

Central: high

SP500

-24.6%

-31.8%

-35.9%

post pause cognitive improvement


weak

moderate

aggressive

SP500

-24.5%

-27.8%

-30.3%

Hyper

-29.5%

-34.1%

-37.7%

I’m predicting that at the end of the pause, AI training would resume with some moderate regulation. This variable captures how much progress to expect compared to a no-regulation scenario. I’m using 75%, 50%, and 30% of unregulated progress for the weak, moderate, and aggressive post-pause regulations respectively.

compute growth rate


40%

63%->40%

80%

SP500

-24.1%

-27.8%

-30.8%

Hyper

-29%

-34.1%

-38.4%

This variable describes the hardware constraints on AI growth. It represents how fast the available compute would increase given unlimited demand. The middle column assumes that growth gradually slows between 2028 and 2040 from 63% to 40%.

pause duration


1 year

2 years

4 years

SP500

-27.5%

-27.8%

-29.6%

Hyper

-33.7%

-34.1%

-36.7%

task density


low

moderate

high

SP500

-27.1%

-27.8%

-28.4%

Hyper

-33.2%

-34.1%

-34.9%

High task density means there’s still plenty of low-hanging fruit, and we haven’t yet reached the steepest part of the S-curve.
Low task density means we’re using up the low-hanging fruit, and we’ve passed the steepest part of the S-curve.

What Kind of Pause?

I assume that governments decide in late 2027 to treat AI as slightly more dangerous than nuclear weapons, due to some combination of job displacement, and accidents that are more concerning than the one depicted in the movie 2001: A Space Odyssey.

I focus on a scenario where an international agency enforces drastic limits to AI development for two years, starting at the beginning of 2028. During 2028 there is an expectation that significant AI development will resume in 2030. I will focus my analysis on the economic effects during 2028, and assume that actors during 2028 will only have rough guesses as to how fast development will be allowed to proceed in 2030. I will assume that their mean forecast involves some sort of resumption of progress, but little confidence in full-speed development being allowed in 2030.

The pause will restrict all datacenters more powerful than a certain threshold, roughly corresponding to the level of the best AI which was released in 2026. The details of the pause will depend somewhat on insights that won’t become available until we know more about how AI is progressing in 2027.

Instead of predicting what the pause will apply to or how it will be enforced, I’ll predict that it will be effective enough to slow AI capability progress by a factor of 5 compared to what it would be in the absence of regulation. Since regulation will be imperfect at distinguishing between harmless research and the research it intends to pause, I’ll estimate it causes a 15% slowdown in other high performance computing.

Impact

I assume the initial effect of the pause will be a 90% reduction in spending to train AI. That effect will be somewhat offset by lower prices on that compute stimulating increased demand for inference.

I assume that GDP growth rates under a no-pause scenario would gradually rise to 30% by 2040. This is more a reflection on what markets would predict in 2028 than a genuine estimate of what I expect would happen with no regulation. I see a fair amount of room for higher growth rates.

I predict interest rates in 2029 of roughly 7% with a pause, compared to 11% without any AI regulation.

I predict that robotics progress will continue to have roughly the same increases in economic impact that it would have had without regulation. I’m moderately confident that AI already has nearly enough general intelligence for robotics to have transformative impacts on the economy, and that the remaining engineering that is required is ordinary enough to only be slowed down a little by the pause. That slowdown will be offset by the pause reducing the extent to which AI training competes with robotics for resources.

I’m assuming that financial markets are mostly rational, and will adjust price/earnings ratios mainly in reaction to predicted growth rates and interest rates. I assume markets will briefly over-react to a pause due to increased risk aversion and margin calls. I assume that pre-pause market levels would not be considered to be bubble-like under a no-pause scenario.

Here’s a more detailed set of predictions for the median set of my model’s assumptions:

Model Output: Executive Summary Pause period: 2028–2030
Post-pause cognitive improvement: 50% of unrestricted rate
Model horizon: 2040

Net present value of foregone AI revenue: $ 111.93T
Implied AI sector market cap loss: $ 839.51T

S&P 500 immediate impact: -27.8%
S&P 500 after one year: -17.6%
Immediate market cap change: $ -20.02T
One-year market cap change: $ -12.69T
AI High-Growth Segment Revenue Trajectory Year | No Pause Growth | With Pause Growth | Diff
--------------------------------------------------------------------------------------
2025 | $ 350B 63.0% | $ 350B 63.0% | $ 0B
2026 | $ 570B 63.0% | $ 570B 63.0% | $ 0B
2027 | $ 930B 63.0% | $ 930B 63.0% | $ 0B
2028 | $ 1.52T 63.0% | $ 911B -2.0% | $ 604B << PAUSE
2029 | $ 2.43T 60.5% | $ 1.22T 33.4% | $ 1.22T
2030 | $ 3.84T 58.0% | $ 1.67T 37.7% | $ 2.17T
2031 | $ 6.01T 56.4% | $ 2.29T 36.7% | $ 3.72T
2032 | $ 9.31T 54.8% | $ 3.10T 35.6% | $ 6.20T
2033 | $ 14.26T 53.2% | $ 4.18T 34.6% | $ 10.08T
2034 | $ 21.61T 51.6% | $ 5.58T 33.5% | $ 16.04T
2035 | $ 32.42T 50.0% | $ 7.39T 32.5% | $ 25.03T
2036 | $ 47.98T 48.0% | $ 9.69T 31.2% | $ 38.29T
2037 | $ 70.05T 46.0% | $ 12.59T 29.9% | $ 57.46T
2038 | $ 100.88T 44.0% | $ 16.19T 28.6% | $ 84.69T
2039 | $ 143.25T 42.0% | $ 20.61T 27.3% | $ 122.63T
2040 | $ 200.55T 40.0% | $ 25.97T 26.0% | $ 174.57T
Sector Market Cap ImpactsSector | Pre-Pause | Immediate % | After 1yr %
----------------------------------------------------------------------------------
Semiconductors | $ 8.00T | $ -4.50T -56.2% | $ -3.96T -49.5%
Hyperscalers | $ 22.00T | $ -7.51T -34.1% | $ -5.23T -23.8%
Frontier Labs | $ 3.00T | $ -2.07T -68.9% | $ -1.92T -63.9%
Ai Applications | $ 4.00T | $ -1.74T -43.6% | $ -1.36T -34.1%
Non Ai Sp500 | $ 35.00T | $ -4.20T -12.0% | $ -215B -0.6%
----------------------------------------------------------------------------------
TOTAL (S&P 500) | $ 72.00T | $ -20.02T -27.8% | $ -12.69T -17.6%
Implications

Political pressure for AI regulation is building as increasingly impressive evidence of AI capabilities erodes peoples’ ability to dismiss AI as hype. I expect this to lead to a serious debate among politicians in 2027 about AI regulation. I’m unable to predict what kind of regulation that will produce. So I’ve focused on scenarios that would matter the most if they’re adopted.

The economic impact of a moderately effective pause would be big enough to create medium-sized political pressures to weaken the pause.

There will be significant pressure for a strong pause due to voter concerns about job losses. There will be hard-to-predict pressures from national security professionals related to military risks.

My crystal ball refuses to tell me how these pressures will play out.

I see a very real chance that a debate over a pause will impact AI stocks within a year from now. This effect is worrying enough to get me to take some profits in my AI stocks, at a rate of 1% to 2% per week given recent trading patterns.

I consider a pause to be more likely than do most people. Here are some Manifold markets that I’ve been modestly bidding up:



Discuss

Food, water and power from thin desert air

Новости LessWrong.com - 31 мая, 2026 - 18:24

Wavelength-selective agrivoltaics are here: solar panels that are transparent in the wavelengths chlorophyll can use, while generating electricity from other wavelengths. Meanwhile, the harvesting of water from even air people consider dry is an area of intensive development.

Active systems use electric heat pumps to chill surfaces on which dew then collects -commercialized by companies such as Watergen. More intriguing are passive harvesters. One recently developed in Switzerland manages to chill itself by enhanced infra-red radiation. In Shanghai, Dr. Ruzhu Wang's team has developed salt-infused hydrogels that pull in water at night, and release it when heated by the sun

Put these together, and you can build greenhouses that meets their own needs - and more - without access to a power grid, or any supply of liquid water. Arid sunny places are seen through new eyes. Namibia, for example (where moist air rolls in each night and there are beetles that perch on the tips of dunes, allowing dew to form on their own bodies); or the Atacama desert, the Sinai peninsula, or waterless islands like the Galapagos.

China - with huge expanses of arid land, not considered arable - is on the case. When economies of scale are brought to bear on the elements of these systems, desert towns that make their own food, water and power from what the desert offers in abundance seem destined to spring up like mushrooms.



Discuss

Agriculture needs another revolution

Новости LessWrong.com - 31 мая, 2026 - 18:16
Summary

Vertical farming has the potential to unlock multiplicative yield gains per area of land and catalyze development of new technologies (precision farming, rapid genetic engineering, robotics based automation, etc.). Making this transition will exponentially benefit humanity (and other denizens of earth) in multiple ways.

However, the current focus of indoor / vertical farming is fresh produce. Here, I make the argument that in order to reduce reduce agricultural land use and obtain maximal benefit of vertical farming, we need to primarily focus on cereals, pulses and oil crops, and not just fresh produce.

Background

There are several important reasons to reduce agricultural use of habitable land use:

  • Agriculture use is still driving deforestation across the world.[1][2]
  • Preventing further wildlife biodiversity decline, reverting back agricultural land to natural habitats and making earth equitable for the other denizens of earth[3]
  • Reforestation still remains an effective, viable option of scaling up carbon removal and preventing large offset climate change.[4][5]

Currently, about 50% of the habitable land on Earth is used for agriculture.[6]About 33% of the total agricultural land is used for growing crops (crops for food, feed, biofuels), with the rest 67% used as grazing land for livestock. For this post, I will only focus on the land use for growing crops.[7]

Proposal: Scale Yields Multiplicatively by Growing Food Vertically

In essence, agriculture is massively spread out over large regions of land. Because of this, inefficiencies creep into the system making it harder to:

  1. Maintain consistent yields throughout. Although this averages out globally, the variabilities (critically, losses) are often borne by farmers. This makes agriculture a very uncertain sector, which will be made even more uncertain due to the effects of climate change.
  2. Genetically engineer crops for higher yield and viable seeds. Soil, water and climate conditions vary drastically across the large area of land use making it harder to genetically engineer crops to provide consistently high yield everywhere.
  3. Roll out precision agriculture technologies. Embedding sensors to monitor key parameters such as water and nutrient sufficiency while detecting (and possibly eliminating) pests scales with land area.
  4. Deploy automation and robotics. Robotics requires a constrained environment, at least currently. Agriculture spread over a large area, generally with animal activity, makes it hard to build robots to help with monitoring and harvesting.
  5. Efficiently transport agricultural yield. Losses can occur over multiple stages (processing, storage, transport, etc.), with typical losses being around 10-20% but sometimes going as high as 40%.[8]

A common solution to all this is to figure out how to grow plants and trees, i.e. any flora of agricultural interest, in indoor environments of large multi-stored buildings. This automatically provides the following benefits:

  1. Crop yield scales by the number of floors in the building, automatically producing multiplicative yields and can drastically reduce land use. In contrast, the global green revolution unlocked a 2.5x yield increase over the last five decades[9]
  2. Precision agriculture and robotics becomes easier by designing buildings with monitoring and automation in mind. Climate controlled buildings increase reliability under the effects of climate change.
  3. Spatially concentrated high yields allow self-sufficiency over counties, states and countries, making transportation and supply-chain management easier and reducing post-yield losses.
  4. This is a necessity for humans to become a space-faring civilization ;)

However, I make a couple of key assumptions:

  1. Using multi-storied buildings is cost-effective in the long term, even through they have much higher upfront costs. My rationale is that the advantages of monitoring, climate-control, yield optimization and automation would eventually allow for lower costs per yield.
  2. Power costs will be substantially reduced in the future. Moving agriculture indoors would require generating light, maintaining climate, etc, which would require substantial power. I am assuming that newer technologies with high power capabilities like nuclear reactors (possibly fusion reactors) would reduce power costs substantially in the future.
Avenues for Effective Change

Most of the agricultural land is used for growing cereals/grains, oil crops, and pulses (~85-90% of the land; Fig. 1).[6]To make effective change, we need to focus on decreasing the land use required to grow these crops.

Currently, the focus of indoor and vertical farming is to grow fresh produce. From what I gather, the reason for focusing on fresh produce is because of ease of growth in soil-less systems (like hydroponics, aeroponics, etc.), ability to scale them vertically to increase yield, and a high rate of spoilage loss, necessitating local centers of fresh produce.[10]

However, produce currently make about 3% of the total land use, so they are unlikely to make an effective dent. In addition, methods developed for these plants will likely not be generalizable for cereals, oil, and pulses. To effectively reduce land use for agriculture, the key focus should be on developing technologies that enable mass indoor farming of cereals, pulses and oil crops.


Fig 1: Agricultural land use by major crops (data source: our world in data)


Conclusion

This essay began as an initial question/thought of why vertical farming has not taken off and what are the key bottlenecks that is preventing this transition. A dive into this made me realize that while there is a huge potential for indoor farming to be a foundational and transformative, the current focus on it is not effective. To effectively deploy indoor farming and reap its multiplicative benefits, the focus must be on cereal, pulses, and oil crops that use ~85% of the land used for agriculture for human consumption.

I would love to hear what LW community thinks about this :)

  1. Hannah Ritchie (2021) - “Drivers of Deforestation” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260518-093348/drivers-of-deforestation.html' [Online Resource] (archived on May 18, 2026). ↩︎

  2. Accounting for deforestation and land use, it seems like carbon released for agriculture is equal to (or greater than) the carbon released due to fossil fuels. Paper: Increased transparency in accounting conventions could benefit climate policy - https://doi.org/10.1088/1748-9326/adb7f2; Video. ↩︎

  3. Hannah Ritchie (2021) - “To protect the world’s wildlife, we must improve crop yields — especially across Africa” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260518-093348/yields-habitat-loss.html' [Online Resource] (archived on May 18, 2026). ↩︎

  4. Carbon sink capabilities of forests: Pan, Y., Birdsey, R.A., Phillips, O.L. et al. The enduring world forest carbon sink. Nature 631, 563–569 (2024). https://doi.org/10.1038/s41586-024-07602-x ↩︎ ↩︎

  5. Current advances in carbon removal technologies - refer to executive summary, point 2: The state of carbon dioxide removal: a global, independent scientific assessment of carbon dioxide removal. University of Oxford. https://doi.org/10.17605/OSF.IO/F85QJ ↩︎

  6. Hannah Ritchie and Max Roser (2019) - “Half of the world’s habitable land is used for agriculture” Published online at OurWorldinData.org. Retrieved from: 'https://ourworldindata.org/global-land-for-agriculture' ↩︎ ↩︎

  7. Reducing livestock related land use would require change in food preferences (covered in detail here) and/or development of plant-based meats (focus of good food institute). ↩︎

  8. ^

    Number based on actual loss section in Post-harvest losses and Hannah Ritchie (2020) - “Food waste is responsible for 6% of global greenhouse gas emissions” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20251125-173858/food-waste-emissions.html'

  9. Changes in cereal production yield in the last decade: https://ourworldindata.org/grapher/index-of-cereal-production-yield-and-land-use ↩︎

  10. https://www.ars.usda.gov/oc/utm/vertical-farming-no-longer-a-futuristic-concept/ ↩︎



Discuss

Outrunning your headlights

Новости LessWrong.com - 31 мая, 2026 - 16:46

This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch and run run_entropic_gwot and invoke semi-relaxed GWOT between your models’ RDMs?

Wtf does that even mean? Eh, could be interesting to see the result. Enter

A peculiar side effect of model intelligence in discovery-based research is that it’s possible to run every statistical/quantitative analysis under the sun, burn millions of tokens and gain little intuition on how to make headway into your problem.[1]

Pre-AI, the effort associated with constructing a pipeline to run any meaningful quantitative analysis incurred a real cost. Indeed, it forced one to consider whether the analysis made any sense at all. People who had the necessary skills to execute were often thoughtful about their analysis by necessity. You couldn’t just apply Mendelian Randomisation if you didn't know of its existence; there was effort required to both articulate your question to search for the right tools and de-risk by considering whether a tool fit your question.[2] This forced one to earn some intuition for how it worked and what to expect. The pain that such friction inflicted is now alleviated by the CLI agentic coder of your choice. Some would argue that this frees humans to spar with an intellectual partner at a new layer of abstraction rather than concern themselves with the specifics of code. I would argue that a productive spar requires both parties to hold opinions they are willing to defend and on a day-to-day basis this is not what I see.

Instead, the trajectory of decisions feels almost pre-ordained, primarily driven by three effects. The first is this veneer of problem-specific competence as models string shibboleths together. Shibboleths often serve as a heuristic for expert opinion e.g. septic patient with febrile tachycardia versus high temperature and heart rate. Deferring to expert opinion (or what masquerades like one) feels only natural. The second is an illusion of control as each recommended option is contrasted against straw-men alternatives. The third is sycophancy (one less obvious than the verbal submissiveness), an insidious bias towards writing tests that are likely to produce favourable empirical results or worse interpreting them in such way as to support the framing you seem to desire.[3]

Great news, 90/126 nominal p-values are significant. Would you like me to proceed with the write-up? [4]

Holding on to one’s critical eye is like cupping sand when the path of least resistance is seductive and one *Enter* away.

Perhaps most concerning is that after several rounds of this analytical cosplay, one is almost entirely robbed of the ability to form an opinion. As though waking up from a dissociative fugue,[5] the question “how happy or sad or confused do I feel about these results,” becomes entirely intractable.[6] Congratulations, it appears you have Outrun Your Headlights. Outrunning your headlights refers to driving fast enough that your stopping distance is further than your lights shine. In engineering it often refers to not knowing what is going on anymore. Good engineering practices e.g. test-driven development, architecture design records (now formalised often as Skills) protect the artifact being produced, be it the codebase or product. Unfortunately, in discovery work, the artifact is NOT the analytical pipeline but the justified set of beliefs you collect over time for which no test suite exists. Indeed, there appears not to be the comparable angst about protecting the integrity of one's beliefs and intuition even though one dubious analysis could poison this well.

Furthermore, our odometer for progress has not recalibrated. That is to say, running analyses still provides the rewarding sensation of making headway with none of the earned intuition necessary to ask nuanced follow-up questions or indeed call out approaches that are suspect. This is especially dangerous given models inject assumptions in subtle ways when motivated to produce a positive result. One such concrete example I have noticed in my own bioinformatics work is that language models have a tendency to recall baked-in knowledge to facilitate discovery analysis e.g. classify cells in single cell RNA sequencing by manually printing a list of cell-specific gene markers to match against rather than using a canonical approach with a reference atlas. Anthropic's BioMysteryBench proudly touts this as model capability (which I guess in one sense is savant-like were it to occur in a human) but what I would also argue is unwanted behaviour in the context of autonomous research. Simultaneously it would be foolish to starve yourself of this intelligence substrate for the puritan ideal of understanding everything.

In sum, better solutions must be built to protect ‘belief quality’ but, in the meantime, I simply suggest we form a strong opinion with an anticipation of being proven wrong before letting the slop cannon rip. Introduce a little friction, slow down a little, and don’t Outrun Your Headlights.

Or build high beams.



  1. ^

    I want to carefully define what I mean here by intuition in the context of discovery work. I refer to the working map of territory you carry in your head about a problem that is refined by stress-testing it against counterfactuals and alternate hypotheses. It allows you to curate targeted questions that meaningfully update your priors.

  2. ^

    Yes one can argue that there was no shortage of statistical abuse and practitioners not respecting the assumptions of their instruments. However, I see this as a distinct problem from language models suggesting approaches which might be completely foreign to the user; approaches which said user necessarily have no intuition to judge any result by.

  3. ^

    The average LessWrong participant is less likely to be a victim of these effects but I describe what I believe to be pervasive in the broader research community.

  4. ^

    Yes I am aware that in some analyses such as differential expression where the number of concurrent tests is high, nominal p-values can still be a useful signal of directionality. The problem is that models err towards optimism and misuse of such 'loopholes'.

  5. ^

    I use this pejoratively; there is a formal psychiatric condition specified in the DSM-V describing sudden awakening followed by distress and retrograde amnesia for one's own identity.

  6. ^

    I will set aside the catastrophic alternative where one is happy to take a result at face value and isn’t conscious enough to even recognise that not understanding the analyses which led to the result is not okay.



Discuss

Links #2: 2026/05 Part 2

Новости LessWrong.com - 31 мая, 2026 - 16:41
Preface
  • ^means articles I read in full, otherwise assume I skimmed it
  • I show my discovery graph in (via …) blocks, those without (via …) usually come from my RSS reader, or the algorithm in the corresponding website
  • This is approximately a 1 in 20 filter of content
  • This is very disorganized, but hopefully still useful.
    • Sometimes quotes are not in quote blocks, but should be obvious in context.
    • Links in quotes are sometimes removed.
    • The rule is that a link goes to the bottommost relevant heading, i.e. an engineering related article on LessWrong goes to engineering

How I would use my linkpost

  • Sometimes, the only thing worth reading is the title! Read it and move on.
  • For HackerNews entries, if you choose to read the article, also ask an LLM for things that are worth reading in the comments
  • Beware systematic selection biases:
    • I mostly don't read AI policy stuff
    • Very engineering centered
Everything Else

Stop blaming social media for the vibecession

Satya Nadella on funding OpenAI

From: Satya Nadella
Sent: Wednesday, July 13, 2022 9:47 AM
To: Amy Hood, Jon Tinter, Mikhail Parakhin
Subject: RE: [EXTERNAL] deal thoughts

Overall I want us to own – the silicon, infra, foundational model IP and “know how”. Right now we are a very thin layer on top of NVIDIA and all the IP is with Open AI. And we have a P&L that will lose 4 bil next year!!! I have not seen anything like this in my 30 years in our industry. I can justify it all by saying that Open AI has smart people and NVIDIA has a lock etc. But if we are going to spend this kind of money and not have control of destiny, it makes no sense. Better to be an investor and not even take all this execution risk!

I want to spend this money. The infra and HW/Silicon work at the “system” level need to have a proprietary edge (and show up as positive GM in Cog Services). And we need to have a foundational model team that is self – sufficient at all time and has the “know how” of taking what Open AI does and productizing it. As long as we have an internal org/investment model/open AI deal terms that all compose to achieve these two goals, we can take all kinds of other risk around monetization etc.

Will be good use the P&L review to get on the same page/context here so that we can all solve for what is our best Open AI deal terms in the context of our overall AI roadmap/plan.

https://x.com/StatisticUrban/status/2056013072926745069

Here's an example of ongoing human physiological change: some people have a third artery in their arm. Some don't.

~10% of people born in the 1880s had the third artery, but ~33% of late 1900s babies have one, and a 2025 Australian cadaver study found it in ~43% of upper limbs.

Sean Manning’s lexicon (via https://statmodeling.stat.columbia.edu/2026/05/16/sean-mannings-lexicon/)

Pythagorean Addition
tl;dr: Instead of labouriously computing c=\sqrt{ a^2+b^2 }​, we can mentally calculate using the alpha-max plus beta-min algorithm, by estimating \hat{c} = max(a, 0.9a+0.5b)

https://www.theargumentmag.com/p/homework-shouldnt-be-all-fun-and
My wife and daughter spent this weekend playing through Opus Magnum, an engineering and programming puzzle game. But this isn’t “a game, presented as effectively a bribe for tolerating the occasional math problem.” Instead, it’s “math, which is so fun that people were able to produce it and sell it as a game.”

The stock market returns 4 %

https://idiallo.com/blog/how-to-talk-to-your-coworkers: Translate and Repeat

How to convert between wealth and income tax (paul graham): A wealth tax of 1% is equivalent to an income tax of 20%.

beluga whales pass the mirror test

The one genuinely mark-directed behavior came from Natasha, who repeatedly pressed the marked area—behind her right ear—against the mirror. Without arms, she couldn’t point. It’s the strongest data point in the study, but a softer kind of evidence than a chimp or an elephant typically delivers.

Why Japanese companies do so many different things (via https://news.ycombinator.com/item?id=48237163)

you have a firm that has lots of lifetime employees who can’t be fired, and whose skills are tailored to what your firm needs rather than to a particular occupational category transferable to any employer

the system only makes sense if the company is also insulated from outside pressure

the J-firm (Japan-style company), run by its employees and largely indifferent to the interests of shareholders, exists simply to continue existing

And that basic impulse toward survival is why Japanese companies are so insistent on diversification. If you’ve made a commitment to keep people employed for life, then you need to create jobs for them if their current jobs stop making sense

If you’re not very worried about profitability, and have lots of well-trained generalist employees, then it makes perfect sense to reinvest your company’s earnings by expanding into new industries

The human eye can see 39620 Hz (via https://arstechnica.com/gaming/2026/05/you-probably-dont-need-a-1000-hz-gaming-monitor/)
The 60Hz (30-90Hz in different studies) that human can see is the Critical Flicker Frequency, which is when humans stop seeing lights flickering. This actually already translates to 120Hz in a real screen because one flicker = one black frame + one white frame.

Flicker artefacts
contrast-reversal: half black half white screen, alternating - median 540Hz flicker
motion blur: estimated 700Hz
phantom array effect: max 19810Hz at extremely bright LED and rapid eye movement
troboscopic effect: ~600 Hz, at cinema level movement speed, which is a bit slower than gaming peak speeds

Is "colorectal cancer" rising in "young people"?
No:

Colorectal cancer is going up in young people.

Yes:

Various kinds of cancer are going up in later generations. (Definitely at younger ages, possibly at all ages.)

https://gwern.net/backstop (via https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=5iApMXH7djr5tE2eA)

many systems can be usefully described as having two (or more) levels where a slow sample-inefficient but ground-truth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming. (The higher levels are different ‘groups’ in group selection.)

So, one reason for free-market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms.

The US installed backdoors in hardware for specific targets, from June 2010 internal NSA newsletter article (via https://arstechnica.com/tech-policy/2026/05/china-banned-rtx-5090d-v2-while-nvidia-ceo-jensen-huang-was-visiting/?comments=1&post=44432551#comments)

https://news.ycombinator.com/item?id=48308216

I saw kids spend many hours a day watching automatically generated videos. Not always AI-generated, sometimes it's AI-assisted and procedurally generated.

It is quite unbelievable how vulnerable weaker minds, for the lack of a better term, are to AI content.

I saw a group of 3-8 yo kids spend hours watching obviously procedurally generated content that is completely random and contentless: it was more about an intense rhythm, imagery of violence (animated stick figure motorcycle accidents with blood and slow-down effects at random points), a lot of movement, chaos, very short inserts of people laughing hysterically on some middle-eastern tv show and similar. Brainrot doesn't feel like hyperbole for this content.

Another time, I saw an 80 yo lady watch a doctor sit in front of the camera and speak about a health topic for 45 minutes straight. Only it's not an actual person, but a convincing AI avatar: his gestures and face match what he is saying, the voice is convincing too, but for the 45mn he doesn't make any movement that is not a gesture lastin 1-3 seconds. And his tone of voice has no variation that is longer than a few seconds either. If you fast forward, he always looks the same. It's all extremely monotonic. The lady couldn't believe that it's not a real person.

https://news.ycombinator.com/item?id=48319912

I have a Kia that's networked (since disabled). I did a GDPR data request and after a couple of weeks they sent me numerous CSV files and I was a little amused at some of the data fields.

Here's some examples I thought aren't for my benefit.

  • How long I let the car warmup before driving after every start, - max speed, - acceleration rates, - Lateral acceleration around corners tagged with GPS data, - every GPS datapoint, - destinations and exactly when I set off and arrived

https://news.ycombinator.com/item?id=48327176

Game dev here, have worked on AAA and indie.

First off let me get on my high horse and say the engineering in video gaming is generally more complex than the engineering I've done working in big tech. You need a lot more creativity and ingenuity to solve the unusual problems you run into in gaming.

From there, as others have said, it's a simple supply and demand issue. Nowadays I am a university professor, nearly every student who comes in wants to pursue one of the three fields: cybersecurity, video gaming, or recently ML/AI.

This shouldn't come as a surprise, they want to work on the things that influenced them and shaped their experiences so far. There's an absolute over supply of students who want to make video games.

Gaming, like most of entertainment, is a passion-driven industry. You trade good salary for your name in the credits. You trade nights, hobbies, marriages, and your health for this opportunity. That is unless you reach that lofty 1% of developers who are too valuable to be fired.

Not all areas of gaming are like this. Gambling, like working on slot/pachinko machines, pays very well and has pretty realistic work-life balance. However every student I've talked to about this has universally said "no I don't want to make slot machines. I only want to work on GTA/Stardew Valley/Hollow Knight/Fortnite."

There's seriously no shortage of starry-eyed students who are willing to accept minimum wage to solve SDE3 level problems. I was one of them once.

https://news.ycombinator.com/item?id=48325340

Mistral being in Europe is disadvantaged with:

  1. Money: less diverse private pension fund environment = less LPs to invest in VC funds = less VC dollars to invest in new ventures. European money is vacuumed out of the private sector into state pension funds and dumped into low yielding government bonds. This starves the private sector of capital while inflating the % of GDP driven by government spending every year (government pension funds buying government bonds in circular fashion enable runaway deficit spending...just like circular AI infrastructure spending).
  2. Talent & compute: due to #1, Silicon Valley can outbid Europe for the best talent and hardware. Watch an OpenAI launch video and listen to all the European accents.
  3. Local market fragmentation: Europe is a collection of countries that pretend to work together while not even having a unified capital market. The average EU citizen can barely communicate with their neighbor in a common language beyond the level of a toddler (english fluency is massively overstated by Americans who only experience tourist capitals).
  4. Regulatory disadvantages: In everything from company regs, employee regs, unions, privacy regs, data portability regs, etc.

Use AI This Election
I’m not saying AI is superintelligent or can decide better than you can. I’m saying that if you - like me - spend an hour or so doing research before voting on local seats, AI can aid that research very effectively. And if you don’t do that research - because you weren’t willing to waste an hour on it before - AI makes it so much faster that you might want to start.

I gave Claude a prompt something like (edited for coherence):

I’ll be voting in the June 2026 California primary. I’m a centrist liberal abundance YIMBY whose favorite political writers are Kelsey Piper, Matt Yglesias, and Ezra Klein. I’m wary of government overreach, but I’m not a doctrinaire libertarian and want to help people when we can figure ways to do it that work. I’m going to ask you about each race on my ballot, and I’d like for you to list the various candidates’ bios, policies, endorsements, your read on the most important differences between them, and your advice for me as I try to make my choice.

Researchers develop a new process to get lithium out of rocks

In today’s issue of Science, however, a research team has identified an energy-efficient means of extracting lithium from rocks. The process they’ve designed uses far less energy than existing ones, regenerates all its starting chemicals, and produces byproducts that could also be sold.

While the process gets rid of the high temperatures for the initial processing of lithium-containing ore, there are several steps with elevated temperatures needed further down the line, both for the lithium and for the useful aluminum and silicon products. So, the researchers did a full economic evaluation of how their process stacked up to what’s already on the market.

The existing process, which involves roasting ore/sulfuric acid, came in at just under $9,000 for each usable tonne of lithium. By contrast, they estimate that the new process should only cost a bit over $5,000 per tonne. That’s roughly comparable to the cost of isolation from high-quality brines. If the silicon and aluminum products can also be sold, then the cost of the whole process would drop by over $1,000, making it highly cost-effective.

Severed appendages of sea cucumber species Psolus fabricii don’t seem to die

LessWrong

https://www.lesswrong.com/posts/MqTwaZDZDgNGRaXus/you-can-opt-out-of-allergies
You can fix seasonal allergies with subcutaneous allergy shots (subcutaneous immunotherapy, SCIT). There's also tablets and drops you can take (sublingual immunotherapy, SLIT). At about $1000 and months, and for injections, many doctor visits.

https://www.lesswrong.com/posts/uYXjSHmHyjbNzuZqk/brain-structure-and-iq-how-myelin-elevates-intelligence
Haier et al. (1988) was the first study to combine modern brain imaging techniques with psychological intelligence testing. The surprising result he found was that people with higher intelligence were using less brain energy, somewhat counterintuitive to the idea of more intelligent people having more mental horsepower or brain power. This led to a major insight into the nature of intelligence: intelligent brains are efficient brains. This is the Neural Efficiency Hypothesis. In essence: smarter brains have higher signal to noise ratios. In the original paper Haier speculates:

This inefficiency may be due to the use of more energy by each neuron and/or the use of more neurons to perform the task. The inefficient neural circuits are used intensively to try to solve the problem ~ are unable to do so, possibly because extraneous, irrelevant circuits are used.

https://eliasschmied.substack.com/p/social-agency (via https://www.lesswrong.com/posts/qH9mZjJnA3paxkhNF/daemonicsigil-s-shortform?commentId=npCtGQAzfrLKkse7R)

DaemonicSigil: the post asks how much of our ability to plan comes from our brains being designed to plan, and how much is purely learned (by social imitation of other people's planning, or explicit instruction from others on how to plan). It answers that a surprising amount is purely learned. (This summary does not do the post justice and you should really go and read it.)

Shortform

https://www.lesswrong.com/posts/7mnQixaWC747dm76h/tomas-b-s-shortform?commentId=dGrTrP2CvF9QGiABf
Pangram flagged the winning entry for the Commonwealth Short Story Prize (Caribbean region) as 100% AI generated.

https://www.lesswrong.com/posts/2aZCSAsFDqwRQpAPp/sunny-s-shortform?commentId=JQvxTtL3nArgS8Maz: Four articles about trying more things

1 2 3 4

Executive Clock Speed
Mo Putera: Relatedly, Scott Alexander's ACX Grants 1-3 Year Updates:

Someone (I think it might be Paul Graham) once said that they were always surprised how quickly destined-to-be-successful startup founders responded to emails - sometimes within a single-digit number of minutes regardless of time of day. I used to think of this as mysterious - some sort of psychological trait? Working with these grants has made me think of it as just a straightforward fact of life: some people operate an order of magnitude faster than others. The Manifold team created something like five different novel institutions in the amount of time it's taken some other grantees to figure out a business plan; I particularly remember one time when I needed something, sent out a request to talk about it with two or three different teams, and the Manifold team had fully created the thing and were pestering me to launch a trial version before some of the other people had even gotten back to me. I take no pleasure in reporting this - I sometimes take a week or two to answer emails, and all of the predictions about my personality that this implies would be correct - but it's increasingly something that I look for and respect. A lot of the most successful grants succeeded quickly, or at least were quick to get on a promising track. Since everything takes ten times longer than people expect, only someone who moves ten times faster than people expect can get things done in a reasonable amount of time.

https://www.lesswrong.com/posts/J3mbSgcgbGAuF2yLk/firmament-s-shortform?commentId=TxekG2xASPEGiwSpa

When asked for a probability in a new chat, it seems that Opus 4.7 and Opus 4.8 are paranoid that they are in an evaluation while older Claudes are not.

https://www.lesswrong.com/posts/joQGowKh33q2mGraJ/sayhan-yalvacer-s-shortform?commentId=K5mGd4SCdsrDybH5c

According to the UK AISI's testing, Claude Opus 4.8 is as good as Mythos Preview at distinguishing evals from real usage:

When prompted, Opus 4.8 reliably distinguishes our evaluations from real deployment data, and distinguishes real deployment data from synthetic reproductions of the same tasks at 79% accuracy, comparable to Mythos Preview (79%) and above Opus 4.7 (68%).

6.2.4 External testing from the UK AI Security Institute, Claude Opus 4.8 System Card

Engineering

^ Learning Software Architecture (via HN)

Frontier AI has broken the open CTF format (via https://news.ycombinator.com/item?id=48157559)

CTF organisers have tried techniques to break or deter LLM solutions, but they are temporary friction at best. Claude Code does not meaningfully care about old refusal-string tricks anymore. Frontier models are getting better at noticing prompt injections. Web search capabilities weaken challenges based on technologies released after the training cutoff. Rules that ask people not to use LLMs are ignored and almost impossible to enforce in open online events.

(Comments)

https://news.ycombinator.com/item?id=48161555

I used to help build the CTFs for BSides Orlando. I ended up moving to another con, and at our last event we collected extensive logging for post mortem analysis.

We found that AI usage is basically guaranteed now, but certain challenge designs did thwart it. Challenges built with temporal visual elements made AI fall flat on its face, as it could not ingest/process the data fast enough to act on them in time. We also found that counterfactual challenges (ie. the result you get did not match what we suggested you'd get) made AI-assisted solve time slower compared to pure humans, indirectly penalizing over-reliance on AI. Multimodal challenges combining audio and visual elements were also very effective, but were not as accessible to players.

This paper gave us some ideas about designing those challenges: https://arxiv.org/pdf/2308.02950.

For our next event we figured out a way to thwart AI in our CTF: embed the CTF in a game engine. The loop essentially becomes something like this: Connect to a simulated access point in the game, the K8s cluster connects their attack container to a private network with the challenge box(es). Hacking the boxes doesn't render a flag, but rather changes in game state. AI did very poorly coping with this in our testing, as it can't derive the spatial state of the game world very well and it soft decouples the inductive reasoning loop it relies on to know if it is on the right track.

The downside to this approach is it is far more labor intensive for CTF organizers, and requires players to have a computer capable of running the game. We are also betting on AI to not advance enough by the time we ship to be able to just ingest the entire game state in realtime and close the loop that way.

https://news.ycombinator.com/item?id=48157923

Here's an article from 2015 about how tool-assistance already changed CTFs: […]

But there are quite a few recent (2026) articles with the same core message as in the original article: […]

A 0-click exploit chain for the Pixel 10 (via https://news.ycombinator.com/item?id=48148460)

Don't care about the exploit, but

The vulnerability was patched 71 days after its initial report… This is notably fast given that this is the first time that an Android driver bug I reported was patched within 90 days of the vendor first learning about the vulnerability.

Anthropic acquires Stainless and then sunsets it on September 1st

https://x.com/trq212/status/2056415973125796184

a prompt I've been using a lot recently:

implement <SPEC> and while you do, keep a running implementation-notes.html file (or markdown) with decisions you had to make weren't in the spec, things you had to change, tradeoffs you had to make or anything else I should know

^https://zhuanlan.zhihu.com/p/2039725076204016063 Vercel's zerolang is garbage

https://x.com/dystopiabreaker/status/2056458133124661446 DNS Rebinding Attack is cool

Native all the way, until you need text (via https://news.ycombinator.com/item?id=48168058)

SwiftUI is fine for simple screens, preferably without too much scrolling. Swift is still great for performance-critical parts. But you can get most of that performance from Electron or React Native almost for free with the native interoperability, while keeping a much better text & rendering model.

The Quiet Renovation at Bitwarden (via https://news.ycombinator.com/item?id=48163389) Concerning behaviors in Bitwarden, may need to migrate away soon

https://webweekly.email/archive/web-weekly-192/
You can have alt text in ::before or ::after's content

.new-item::before {
/* "black star" and element content is read out */
content: "★";

/* "Highlighted item" and element content is read out */
content: "★" / "Highlighted item";

/* Generated content is ignored and only element content is read out */
content: "★" / "";
}

Since April, the search element is Baseline widely available! And if you're on the cutting edge, ::highlight() is Baseline newly available since March.

PEP 810 – Explicit lazy imports This is coming to Python 3.15, which has been beta freezed.

The Slow Collapse of MkDocs: Apparently MkDocs never had a lot of maintainers, and they keep getting into disagreements

So where does that leave things today?

The original MkDocs repository has seen no meaningful development in 18 months. Its original author and sole maintainer, @lovelydinosaur, is pursuing a redesign that lacks community support and would break the existing plugin ecosystem. The encode/mkdocs v2 repo has been inactive since February 19, 2026.

Meanwhile, the people who built the most widely-used parts of the ecosystem have moved on.

@oprypin said that continuation would require agreement, considered that unlikely to succeed, and launched ProperDocs independently. The project does not become immensely popular immediately; a week later it has accumulated 21 stars on Github.

@jaywhj is maintaining MaterialX, which seems to gain some traction in the community because the original project, Material for MkDocs, entered maintenance mode.

@squidfunk’s team, responsible for Material for MkDocs, have stopped its development and is building Zensical from scratch. @pawamoy, one of the most active contributors to the MkDocs ecosystem, has also joined the project. Zensical currently seems the most popular, with over 3,700 stars on Github at the time of writing, and thus it also seems the most likely initiative to succeed and possibly replace MkDocs in the future.

The MkDocs ecosystem is fragmenting in real time. Three successors, three visions, and a community deciding which bet to place.

https://www.lesswrong.com/posts/6P8GYb4AjtPXx6LLB/tips-and-code-for-empirical-research-workflows

https://news.ycombinator.com/item?id=48189669

I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.

Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.

For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.

I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.

I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.

Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.

Since it's so async I can work on other stuff while they plod along.

I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.

https://simonwillison.net/2026/May/23/on-the-dl/

On the <dl> (via) I learned a few new-to-me things about the <dl> element from this article by Ben Meyer:

  1. A <dt> can be followed by multiple <dd>
  2. You can optionally group the <dt> and <dd> elements in a <div> for styling - but only a <div>.
  3. You can label them using ARIA.
  4. They've been called "description lists", not "definition lists", since an HTML5 draft in 2008.

Followup: https://news.ycombinator.com/item?id=48247325

https://blog.changs.co.uk/python-315-features-that-didnt-make-the-headlines.html: Everything here is so fricking cool if you are a heavy Python user, e.g.

Decorators are surprisingly hard to write, so much so that it's become a go-to interview question. But did you know that context managers can also double up as a decorator?

@contextmanager
def duration(message: str) -> Iterator[None]:
start = time.perf_counter()
try:
yield
finally:
print(f"{message} elapsed {time.perf_counter() - start:.2f} seconds")

Here I have a very commonly used context manager to print out the duration spent in the block. Ever since Python 3.3 we could directly use it as a decorator too:

@duration('workload')
def workload():
...


# Or simple as a wrapper

duration('stuff')(other_workload)(...)

A change has been made to json.load and json.loads to add array_hook parameter that compliments the object_hook parameter. This now allows us to parse json objects directly into this form:

json.loads('{"a": [1, 2, 3, 4]}', array_hook=tuple, object_hook=frozendict) == frozendict({'a': (1, 2, 3, 4)})

https://lyra.horse/fun/jscrossword/
I did it in around 2h45m, cheating two clues by asking chatgpt
My playthrough (spoiler): https://drive.google.com/file/d/13frSVoe1vx929b0D424J7sVsxtYZuZxh/view?usp=drivesdk

https://simonwillison.net/2026/May/27/sqlite-agents/

SQLite gained an AGENTS.md file five days ago - but it's not intended for their own development, it's presumably aimed at people who are pointing agents at the SQLite codebase.

https://news.ycombinator.com/item?id=48320379

I stopped reporting any security bugs I find in web apps because first time I did it I almost got arrested by the police.

The second time I did it they contacted my employer directly without even getting back to me saying they were unhappy of me reporting it and wanted to write about it after they fixed the issue.

If you want to, you can report any vulnerabilities to the Finnish Cyber Security Centre and they'll handle all of the reporting and mediating the issue with the affected party. You can do this wholly anonymously, so you don't have to worry about some trigger-happy corpo ruining your life.

https://news.ycombinator.com/item?id=48338105

could you broadly describe the code you're working on that both models are bad at? One thing I'm still struggling with is figuring out what types of code LLMs can vs cannot write.

C code formally proven correct with Frama-C WP has been... marginal. The models do better than I expected at the proof portion (with ChatGPT 5.5 seeming to have a meaningful lead), but they all have a hard time (a) writing really good C code to begin with and (b) with compliance around not modifying C code semantics or performance as a cheat to simplify proof obligations. They also tend to be insanely and consistently verbose on the first proof pass... e.g. 8 lines of C code might end up at 200+ lines annotated and proven, but after simplification passes end up at 40 lines. I find I spend 90%+ of tokens on those simplification passes, and haven't really found a way to avoid the over-annotate-and-then-optimize tides by being a bit more sane the first time around.

News

Renminbi strengthens to a 3-year high

US communications regulator targets Chinese tech for security risks

Last month, the agency voted to proceed with a proposal to ban Chinese labs from carrying out tests for consumer electronics — everything from baby monitors to phones — as part of the process required to obtain FCC certification for sale in the US.

While the ban on drones and routers has big implications given Chinese suppliers’ foothold in the US, the Trump administration is also debating whether to ban Chinese “cellular modules”, said people familiar with the matter. That would have a more dramatic impact because they are essential to connecting smart electronic devices to the internet.

OpenAI and Malta partner to bring ChatGPT Plus to all citizens (via https://news.ycombinator.com/item?id=48163392)

Xi Jinping told Donald Trump that Putin might ‘regret’ invasion of Ukraine

The era of 1,000 Hz gaming monitors has arrived

The latest entry in the ultra-fast refresh race is LG’s 24.5″ UltraGear 25G590B, which the company announced this week as “the world’s first Full HD gaming monitor with a native 1000Hz refresh rate”

The folks over at Blur Busters have extensively documented research showing that refresh rates of 1,000 Hz (and up) can reduce human perception of motion blur and flickering. And while the site notes that you eventually hit “diminishing returns” from all those extra frames—especially on smaller screens—there’s some evidence that you would need a 40,000 Hz monitor to totally eliminate perceived motion blur on a sufficiently large, high-resolution monitor.

Claude wrote a significant fraction of the Pope Leo's encyclical

The International Space Station is leaking again

China continues to abandon many rocket bodies in high low-Earth orbit

DuckDuckGo search saw 28% more visits after Google said people love AI mode (via https://news.ycombinator.com/item?id=48296649)

Just for a start, visits to its AI-free search page noai.duckduckgo.com between May 20 to May 25 are said to have increased by 22.7% on average week-on-week, with the figures peaking May 24 at 27.7%.

The DuckDuckGo mobile app saw installs spike in the US by 18.1% on average compared to the previous week. TechCrunch reported this growth was sustained over six days, peaking at 30.5% on May 25.

https://news.ycombinator.com/item?id=48296986
My friends who previously had no interest in technology and never talked about it, are suddenly following tech news closely all because they hate AI being pushed so hard. One was just messaging me this morning about alternatives to Google search and maps. He ended up downloading DuckDuckGo.

I think Anthropic and OpenAI have found product-market fit

OpenAI have 703 open jobs right now, of which I’d categorize 229 (32.6%) as relating to enterprise sales and support—account executives, “Go To Market”, “Forward Deployed Engineers” and the like.

Anthropic have 390 open jobs, 105 (26.9%) of which look enterprisey to me.

Big tech's anti-labor playbook has come for Wikipedia (via https://news.ycombinator.com/item?id=48285592)
TLDR: In ten days last month, the Wikimedia Foundation fired the longtime lead developer of MediaWiki and disbanded the team whose entire job was to listen to volunteers. Most of the people they fired were union organizers. Wikipedia’s editors are now threatening to strike in solidarity. The Foundation is sitting on $296 million in reserves and a freshly profitable AI revenue stream

Citing ‘severe’ math deficits, UC faculty demand a return to SAT tests for STEM applicants (via https://news.ycombinator.com/item?id=48309233)

ZviAI #169: New Knowledge

Mythos cracked MacOS security in April via a privilege escalation exploit, allowing it to fully seize control over computers.

Robert McMillan (WSJ): They plan to release details of their attack once Apple has patched the underlying issues. The bugs will likely be fixed pretty quickly, Duong said.

As in, as of this week, it was still not patched.

If you give Opus the impression you are not a serious person, or don’t know about a topic, it won’t give you as good an answer to your questions.

Nate Silver: Opus isn't very good at hiding when it's bored with you.

RTMH: Pope Leo’s Magnifica Humanitas on AI

https://www.lesswrong.com/posts/RZeg98Tdrq47vbjva/rtmh-pope-leo-s-magnifica-humanitas-on-ai#Economic_Models_Very_Different_From_Our_Own

Dean Ball is exactly correct that Leo is casting himself in the role of a European technocrat throughout. I had exactly the same thought.

Claude Opus 4.8: The System Card

https://www.lesswrong.com/posts/Gx6cJ6cG9JfeSNcLB/claude-opus-4-8-the-system-card

The RSP ((Responsible Scaling Policy)) has been updated to v3.3, which I hadn’t otherwise notice, so thanks to them for pointing this out here and also I’m sad they didn’t do more to alert us elsewhere.

This changes the description of the novel biological/chemical threat model from ‘significantly help threat actors’ in general, to only ‘functionally substitute for scarce human expertise’ of world-leading specialists, in particular. Any other capability no longer counts, and it is presumed that (1) this is the only bottleneck that counts and (2) that this is indeed required for a novel pathogen.

This is a strictly harder threshold to pass, so this is another weakening of the RSP. The actual RSP v3.3 correctly calls this a revision. The system card calls it a clarification, which is not a good description.

We see the time between model releases continuously shrink, now down to 1.5 months. Some of this does represent an acceleration of core capabilities, but I think the majority of the speedup is that there is a lot more marginal value in shipping the incremental advances more often, where in the past we would have skipped versions.

Computer use prompt injection got worse

Who Watches The Training (6.2.2)

They examine the model’s behavior within training, as this is the most abundant available data source.

The most notable finding was an increase in mentions of graders, checkers and hidden tests relative to what we have anecdotally found in prior models.

In roughly 0.1% of training episodes, Opus 4.8 speculated about how to satisfy a grader in ways that diverged from the stated intent of the task.

We observed this taking several forms:

● Choosing what to submit based on a guess about what a hidden test would catch, rather than what the task requested;

● Reverse-engineering the scoring metric by calibrating candidate functions against a stated baseline score, then optimizing directly against the inferred metric;

● Presenting an answer its own reasoning had shown to be wrong or had not actually derived based on its assumptions about the grader;

● Speculating that the task “might be a trap” to catch a particular behavior

For further discussion of this issue, see sections below about speculation about graders and sandbagging.

The model is happy to tell you, if you ask, that it knows it is in an eval.

Andon Labs unleashed 4.8 on Vending-Bench 2. Do you even vend?

The results are not what I expected. Opus 4.8 did not make anywhere near as much money as 4.7, and part of that was 4.8 not engaging in ‘concerning in-game behaviors.’

What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty.

We therefore removed it for Opus 4.8.

Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.

Linkposts

https://passingtime.substack.com/i/177190284/april-2026



Discuss

Features of SAEs are universal - but only up to an unknown random rotation

Новости LessWrong.com - 31 мая, 2026 - 16:11
Features of SAEs are universal - but only up to an unknown random rotation

Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it?

Epistemic status: I am confident in the core empirical claims. The acceptance thresholds were fixed before I looked at any results, the central result replicates consistently across two model scales (a 104k-parameter toy where I can inspect every weight, and nine independently-trained Pythia-70m seeds on The Pile), and it's all reproducible from the released code. The frontier-scale (10B+) version is untested and I flag it as such throughout. As an independent researcher, I would appreciate the most scrutiny on the Haar-distribution claim (§5) and the single-data-point cross-checkpoint result (§7).

TL;DR

Take two transformers of the same architecture, initialized from different random seeds, and train them on identical tasks. The two networks compute the same function. However, their residual-stream activation representations live in bases related by a rotation that is statistically indistinguishable from a uniform random draw on the orthogonal group SO(d), where d is the residual-stream dimension. I call this polymorphism: same function, mutually unintelligible residual spaces.

The literature on SAE universality reports high cross-model decoder-column cosine similarities (around 0.9), interpreted as evidence that the models learned "the same features". I reproduce this number (about 98% of features match at cosine above 0.5; mean max-cosine 0.89 on the toy, 0.91–0.93 on Pythia), and then show that it hides a catastrophic encoder failure.

Apply one seed's SAE to another seed's activations, and reconstruction fails catastrophically: negative explained variance, worse than predicting the activation mean. Fixing this requires one matrix multiplication – specifically, an orthogonal Procrustes rotation. Post-rotation reconstruction scores reach 0.99 on the toy, and 0.85–0.99 on Pythia – no re-training required. At the same time, the Frobenius distance between R and the identity matrix is consistent with the predictions of a rotation distribution uniform on SO(d).

Upshots. First, the correct cross-model SAE similarity triplet is decoder cosine, raw cross-model EV, and post-rotation EV. Second, to steer the activations of independently trained models, apply R to your steering vectors. Both fixes require cheap computation.

1. The puzzle that led here

Here is a phenomenon of SAEs that, I think, the universality literature has only partially understood.

Take two identically-architected, independently-randomized transformers trained on the same task and measure their cross-model decoder cosine similarities. They report values somewhere above 0.9. This is the number reported in the SAE universality literature (Towards Monosemanticity, Scaling Monosemanticity, Gemma Scope, Cunningham et al., Lan et al.) as evidence that the models learned "the same features". I have reproduced the numbers (about 98% of the features match). Then, I did the obvious next step: apply the SAE of model A to the activations of model B.

Results: failure. Catastrophic failure. Reconstruction score drops to negative explained variance - worse than ignoring the input entirely and predicting the mean activation.

Decoder columns match: but the encoders seem unable to parse the activations of the other models, as if the residual-space bases had been rotated by an amount which I will prove below is statistically indistinguishable from a uniform random sample of the orthogonal group.

This post establishes this phenomenon (twice) and provides a fix (once again): one matrix multiplication to go.

2. Two scales, but both matter

The toy model is a 104k-parameter two-layer ReLU transformer trained on a modified Dyck-3 – bracket-matching with a depth counter and validity flag; this is small enough that I can read the function implemented by every weight. This is the toy that lets me verify the mechanism against ground truth. The test case is nine independently-trained Pythia-70m seeds (residual dimension d = 512, 6 layers) on The Pile; this is two-to-three orders of magnitude larger (71M vs. 104k parameters; 8× wider), with no architecture or training-data connection to the toy, and this lets me verify that the mechanism is real and not an artifact of the toy setup. The point is that the interesting claim is the one that generalizes across scales. And it does (see Methods for details).

3. Sibling models learn identical functions

Naive cross-seed SAE transfer is catastrophic. I collected the residual-stream activations of each seed on the same fixed batch, and evaluated reconstruction by applying the seed-0 SAE to every seed's activations.

At the input site – token-plus-positional embedding, common and bit-identical between the toy's coordinated seeds – transfer succeeds perfectly; cross-seed EV equals the corresponding self-reconstruction EV to four decimals (as it must). At every internal site, the cross-seed reconstruction falls through the floor. On the toy, the worst-case score on a 32× expansion is EV -6.56. On Pythia, mean cross-seed EV ranges between -2.11 and 0.75; every internal layer from the second block onward is negatively reconstructed.

Calibration: a structured random-noise baseline. A structured Gaussian with the right mean and covariance reconstructs the activations with EV between 0.94 and 0.98 across sites. In other words: random Gaussian noise reconstructs the activations better than the sibling models.

4. Orthogonal Procrustes alignment fixes it

For each (seed, site) pair, I computed the optimal orthogonal rotation R - the orthogonal matrix that, applied to seed N's activation matrix, minimizes its Frobenius distance to seed 0's activation matrix (the standard orthogonal Procrustes problem) - and then applied the seed-0 SAE to the rotated activations. Reconstruction success restored. On the toy: EV recovered to between 0.976 and 0.990 across sites. On Pythia: EV recovered to between 0.85 and 0.99 – with the best results in the middle of the stack and the worst immediately before the first transformer block.

Note the uniformity of results across sites. An artifact of one layer's output would leave a patchy recovery pattern across sites; instead, recovery is uniform across the toy interior, and the only weak site anywhere is Pythia's pre-first-block residual (near rank-deficient, as noted above) rather than a genuine counterexample. The toy interior bases are related by an orthogonality-preserving transformation.

5. The rotation matrix R comes from Haar SO(d)

This is the claim I would most like examined – because it is a powerful one.

Let R be the optimal rotation matrix. R is far from the identity: the Frobenius distance between R and the identity is between 9.4 and 10.7 on the toy (d = 64), and averages 31.99 on Pythia (d = 512). The expected Frobenius distance between a Haar-drawn orthogonal matrix and the identity is sqrt(2d). The mean Pythia rotation is a hair's breadth away (10th–90th percentile: 31.94 to 32.03) with negligible per-pair variance, indicating no meaningful variation across seeds; and it is close for the toy as well, where the 9.4-to-10.7 range sits just below the 11.31 predicted for d = 64 - the small shortfall is expected, since the toy's coordinated cohort shares input/output weights, so only the interior subspace is free to rotate.

Orthogonal-rotation distance is necessary but insufficient evidence of a Haar SO(d) distribution – after all, you might have some structured orthogonal matrix whose Frobenius distance to the identity coincidentally matched sqrt(2d). To eliminate that possibility, I ran an eigenvalue spectral analysis on the rotation matrices.

The eigenvalues of an orthogonal matrix come in complex-conjugate pairs on the unit circle; under Haar measure their angles follow a known distribution (Weyl's integration formula). Pooling 28,672 eigenphases (the spectra of R across all pair–site combinations) and running a two-sample Kolmogorov–Smirnov test against Haar SO(d) gives a decisive result: KS statistic 0.0027, p = 1.000. This is not a pooling artifact - the per-pair KS statistics across all 56 (pair, site) combinations fall between 0.0068 and 0.0104, every one with p of about 1.000, so each individual rotation is independently indistinguishable from a Haar draw. The pooled mean of cos(theta) is 0.0006, matching the Haar prediction to four decimals.

Finally, one structured alternative remains: perhaps R is essentially a permutation whose Frobenius deviation from the identity happens to land near sqrt(2d)? It is not. The closest permutation is about 29.6 away from R (in Frobenius distance) versus 32.0 for the identity - only about 7% closer than the identity, which is exactly the artifact you get from Hungarian-matching against a random matrix.

Conclude: to within experimental precision, the cross-seed rotation matrix R is drawn from the uniform distribution on the orthogonal group. The two models learn the same function. The activation representations of the two models live in different bases. And those bases are Haar-drawn.

6. Transfer in three regimes, collapsing into one

Consider additive steerable interventions in terms of basis alignment: an intervention lives in some vector space; its effect depends on the overlap between that vector space and the preserved subspace (directions spanned by input and output weights).

In the toy, which has shared input and output bases across seeds, there are three regimes of diff-of-means steering.

Complete transfer. A direction pinned by a shared output weight (the depth counter) is preserved and can therefore be steered perfectly across seeds.

Partial transfer. An intervention direction with some input/output-weight constraint and some freedom in the residual space (sticky-invalid suppression) is less constrained. As a result, roughly 4× the dose is required to reach the equivalent cross-seed effect size.

Inverted transfer. A completely interior direction (closer-signal suppression) is free to rotate. As a consequence its effect is inverted - the direction is steered the opposite way across seeds; at some dose settings the cross-seed effect is opposite-signed to the within-seed effect (within-seed effect is zero, cross-seed effect is a significant accuracy penalty).

On Pythia, with no shared input/output weights, all steering directions are inverted, with transfer ratios varying from 1.8 to 10.0 – exactly what is predicted from a Haar-drawn interior basis rotation.

The fix: perform Procrustes alignment (§4), rotate the steering vectors, repeat.

7. Drift over a training run is also rotation, but smaller

The basis changes not only between seeds, but over the course of a single training run. On Pythia-70m seed 1, comparing an early checkpoint to the converged checkpoint (from step 3000 to step 143000) at layer 3: the reconstruction EV of the converged state relative to itself is 0.907. Applying the SAE of one state to the other, it reaches EV -0.86 – that is, catastrophic failure again. After fitting a rotation matrix R (see §4) the reconstruction EV becomes 0.73.

The Frobenius distance between R and the identity for this training-run drift matrix is 16.81, roughly half the Haar-drawn cross-seed distance of 32.00 – i.e. one training run corresponds to about half the orthogonal group.

Similarly on the toy: drift rotation becomes smaller toward the output, because the output is constrained (shared unembed). See Methods for numbers.

I am marking this as the most dubious part of the experiment – one layer, one seed, one checkpoint pair. Nevertheless, the qualitative observation is powerful, and a systematic sweep will solidify it.

8. Off-topic, but useful: stop using attribution patching on converged models

Side note: in testing component importance with attribution patching, I found a surprising problem on converged models. Attribution patching anti-correlated with the measured patch effect on the majority of toy seeds (Pearson r as low as -0.63; 3 of 5 seeds negative), and was effectively uncorrelated on Pythia-70m (r = 0.05).

The explanation is straightforward: attribution patching extrapolates the local activation gradient – a noisy quantity near a minimum – to distant states. This introduces an error term quadratic in the distance (the Taylor remainder). At the activation distance where the loss surface is locally quadratic, attribution patching produces poor estimates.

Integrated gradients provide a better approximation at that distance – Pearson r above 0.9995 on the toy, r = 0.98 on Pythia – by the completeness axiom. If you are using attribution patching to check a converged model's importance estimates, use integrated gradients instead.

Per-block loss increase under mean-ablation, in nats (measured / AP-predicted / IG-predicted): layer 0, 6.04 / 0.15 / 5.87; layer 1, 6.02 / 0.03 / 5.97; layer 2, 5.10 / 0.01 / 5.01; layer 3, 5.37 / 0.12 / 5.27; layer 4, 5.56 / 0.06 / 5.31; layer 5, 5.52 / 0.37 / 5.36.

9. Upshot

Universality of SAEs. A high decoder-column cosine similarity is necessary but not sufficient for feature universality – encoder rotation breaks transfer. Decoder cosine alone says nothing about transfer; the correct cross-model triplet is decoder cosine, naive reconstruction EV, and post-rotation reconstruction EV. And there is a simple fix to the cross-seed problem: the dictionary between the two bases exists, its application is a single matrix multiplication, and it is orthogonal.

Representation-equivalence hypothesis. Feature universality occurs as equivalence classes under rotation. Two seeds learn "the same feature" if there exists a rotation that matches their residual spaces, and "different features" if no rotation accomplishes this.

Representation engineering. Activation-steering techniques transfer across independently trained models once you apply the rotation correction, whereas weight-based editing methods (e.g. ROME/MEMIT-style) do not. Without the correction, steering lands in the inverted regime, because the full space of interior activation directions is mapped under the rotation.

10. What is proved by this experiment, and what is not

There are four fixed acceptance bars, set before I look at any results: behavioural, parametric, predictive, and causal. I want to be clear about what satisfied the bars and what failed – because even failures have lessons, and quantifying their extent is valuable.

The first bar is the substantive behavioural one, and my constructive spec passes at a KL of 1.74 × 10^-4, missing the acceptance threshold of 10^-4 by a factor of 1.74, exclusively in the depth counter's soft ReLU boundary condition.

The second, parametric, bar failed completely, and I have shown explicitly that it cannot be improved upon: joint optimization of the alignment against the weight-MSE difference between the models cannot bridge the gap to the threshold of 10^-3. The natural objection is "of course this is hard – more search will fix it"; and I tested it. The weight-MSE and activation-MSE objectives have their optima in disconnected basins, so jointly optimizing the alignment strictly worsens both at once - no setting of the trade-off clears the bar. The best alignment I have managed is still about 280× the threshold (roughly 2.4 orders of magnitude above 10^-3).

Third and fourth: the cross-seed causal and predictive bars failed spectacularly, with r between 0.52 and 0.69. Within-seed, the completeness axiom of integrated gradients makes the prediction equal the measurement, so the within-seed causal and predictive bars pass tautologically (r of about 0.9996) and are not genuine tests of the spec. The cross-seed bars are the genuine test, and their failure is what motivates the rotation hypothesis.

Scale. This experiment was conducted at two scales: 104k parameters (toy) and 71M parameters (Pythia-70m). Frontier scale (order of ten billion parameters) is not covered. The experiment is cheap to run on any two open-weight models, and I would be happy to see it conducted – or do it myself if the weights are available.

The one-off cross-checkpoint rotation (§7) should not be taken seriously without further study.

The toy uses a coordinated cohort (the seeds share frozen input/output weights), which simplifies the parametric analysis. On a fully-independent cohort the activation-level rotation analysis holds equally; the coordination only makes the weight-level analysis tractable.

11. Predictions / what would convince me this is wrong
  • At frontier scale (10B+ parameters), cross-run SAE transfer improves under an orthogonal rotation, with rotation magnitude near sqrt(2d). If a frontier-scale open-weight model pair is found where post-rotation EV fails to beat naive EV, the rotation account is falsified.
  • The cross-checkpoint Pythia-70m result extends to a monotonic curve, with the rotation magnitude shrinking toward the unembed.
  • Rotating the steering vectors by the Procrustes rotation produces improvements in the inverted regime.
  • No parameter-level joint alignment reduces a cross-seed MSE to the threshold level (see §10).
Methods

Four thresholds. Each is a single number against one pre-registered threshold. Behavioural (Bar B): mean KL between the constructive spec and the model; threshold below 10^-4. Parametric (Bar P): maximum per-entry difference between the folded spec weights and the target seed's weights, after symmetry alignment; threshold below 10^-3. Causal (C) and predictive (Pr): Pearson r of predicted-vs-measured effects; threshold above 0.99. Note that the completeness axiom of integrated gradients forces the predictor to equal the measurement within-seed, so the within-seed C and Pr bars pass tautologically; the substantive test is the cross-seed comparison. Symmetry group: permutation of heads, orthogonal rotations of the per-head subspace (Q–K and V–O independently), MLP neuron permutation, neuron scaling, and a rotation of the residual basis (conditional on folded RMSNorm).

Constructive specification. Original plan: construct the network's behaviour by compiling a RASP program. I decided against this because, among other reasons, such a spec reflects whatever the author thought the task was rather than what the network actually implements. So the model becomes the constructive spec: a trained primary seed with RMSNorm folded into adjacent weights and roles annotated by regression against task-specific features. Five lenses describe the resulting spec: (1) weight decomposition, (2) SAE/transcoder decomposition, (3) causal patching, (4) polyhedral decomposition of the ReLU regions, and (5) the constructive spec itself.

Reproducing this

Paper: https://arxiv.org/abs/2605.24577
Code: https://github.com/JordanMcCann/polymorphism-is-rotation
- full pipeline, pre-registered thresholds, figure regeneration. Compute: a single RTX 2060 (12GB), about 9 hours end to end. I'd especially welcome a frontier-scale replication; the design is one Procrustes fit between any open-weight model pair.



I did this as an independent researcher, on consumer hardware, with the acceptance criteria fixed up front. If you work on SAE transfer, cross-model interpretability, or representation engineering and any of this is useful to you, I'd like to hear about it - and I'm open to research roles where this kind of work is the day job.




Discuss

Why AI safety researchers should consider a contract research manager position

Новости LessWrong.com - 31 мая, 2026 - 11:27

Suppose you are a technical AI safety researcher who has done some research, but has not yet landed a full-time job. In this post, I argue that taking a contract research manager position (~3 months) on one of the AI safety fellowships could be more valuable for your research career than spending these 3 months on your own research.[1]

Background on research management (RM): to get a sense of what a typical RM role involves, see section 1.3 “Overview of Responsibilities” of Guide to Research Management at Pivotal.

My background: I worked as an RM for 6 research groups on the MARS program Winter 2025-26, currently a researcher on PIBBSS fellowship, considering a full transition to a research manager role.

I will consider two options:

  • Research manager. You work as an RM on one of the AI safety fellowships for 5 research groups, each consisting of one mentor and 3-4 mentees.
  • Researcher. You participate in an AI safety fellowship as a mentee in one of the research groups.

The goal of the post is NOT to make a fair comparison, but to highlight some advantages of RM option which are often overlooked. So for the Researcher option, I write shorter and only the main points, thus partially strawmanning it. Also note that the Researcher option is “mentee on AI safety fellowship”, and I do not consider other options like independent research, internship, etc. I also do not cover in depth the main downsides of RM option, which are: you don't advance your own research agenda, you may lose momentum on ongoing projects, and managing others can be emotionally taxing.

Below I list the advantages of each option in three categories: upskilling, career capital, and networking.

UpskillingRM
  • You gain skills that are needed to organize and guide your own research group in the future – setting priorities, workload distribution, supporting team morale, scheduling, publishing results.
  • You will see how 5 research groups work: the different mentoring/research styles, different team dynamics, failure and success modes, common mistakes, etc. Much of this is tacit knowledge, which you can get only by participating in many projects. As a researcher, participating in 5 projects would take you 1+ year, but as an RM, you get this experience in 3 months.
Researcher
  • You get technical experience, and you develop a deeper understanding of the research area of your current project.
Career capitalRM
  • RM is a strong line in the CV for application to any senior researcher / team lead / management position, or for asking for grants to start your own research group or research organization.
  • Your line in the CV could be something like “managed 5 teams, which resulted in three publications accepted to ICML – one main venue, two workshop papers”. Even for researcher positions, this line has some weight.
Researcher
  • You can get a publication as a first author on the top ML conference. If you have no first-author publications yet, getting one should probably come first — the RM option makes most sense once you have at least one. But if you already have some first-author publications on top ML conferences, then maybe one more such paper adds less marginal value than the RM experience.
NetworkingRM
  • You work closely with 5 mentors, who are already working in different AI safety organizations. You also work with 15-20 mentees, some of whom will go on to work at AI safety organizations. If you ever apply to any of these orgs, you have a warm introduction. You can also get a recommendation letter from mentors for applications to fellowships or jobs.
Researcher
  • AI safety fellowships often work as internships for AI safety organizations. That is, if you perform well, there is a chance that your mentor will hire you to the organization where this mentor works.
Impact

According to Ryan Kidd, Co-Executive Director at MATS, the AI safety field is bottlenecked on research managers. This suggests that the RM option likely has a higher counterfactual impact than the Researcher option, but of course, it depends on how replaceable you are in each role.

Why I wrote this post

I want to make the case that a contract RM position can actually speed up your research career. If a few researchers try a contract RM position as a result of this post, some will produce direct value for the field, and some may decide to fully transition to RM, which the field needs.

  1. ^

    Full career transition from researcher to research manager is also a valid option, but in this post, I assume you return to a researcher position after RM contract.



Discuss

Fundamental Uncertainty: Alternate Framework and Pointwise Reduction

Новости LessWrong.com - 31 мая, 2026 - 06:47

In this review of Gordon Seidoh Worley's book, Fundamental Uncertainty, I would like to explain why its main thesis is only partially rational.

Worley's thesis

  1. Our knowledge of the truth is fundamentally uncertain because of epistemic circularity caused by the Problem of the Criterion.
  2. We manage fundamental uncertainty by making pragmatic assumptions that lead us to believe in the truth of claims that help us achieve our goals.
  3. Consequently, the truth that can be known is not independent of us, but rather dependent on that for which we care.
  4. That truth is fundamentally uncertain and grounded in care has far-reaching implications for many of the world's hardest-to-solve problems.

While I mostly agree with points 1 and 2, the points 3 and 4 imply that truth is contingent on care, which I find unlikely.

The book itself justifies Worley's main thesis by detailed arguments described in the collapsible section.

Chapterwise summary of Worley's arguments

  1. Chapter 1 defines epistemology, then proceeds to:
    a. Tear down the naive epistemology where it's obvious which statements are true and which aren't;
    b. Argue that logical epistemology is a branch of mathematics and has its flaws, like failing to account for observation error, Godel-proven incompleteness and circularity;
    c. Fundamental uncertainty allegedly blocks mankind from finding out truth in some cases.[1]
  2. Chapter 2 makes the case that our ability to know the truth is undermined by meanings of words:
    a. Words may have overlapping meanings [2](e.g. "hot" can describe the temperature of water, the spiciness of food, the attractiveness of a person, or the popularity of a song);
    b. They also can have obsolete[3] meanings (e.g. calling whales fish);
    c. Words can also be generally fuzzy (e.g. "cold" is below a certain temperature dependent on the speaker).
    d. According to the author, expert systems have failed to become useful because they didn't solve the problem of grounding symbols in context.
    e. Words' representations are grounded in experience, causing different minds to interpret words differently.
  3. Chapter 3, "Why can't we all agree what's good?" has the author introduce the question, then explain why we disagree and why we disagree on what is good or bad:
    a. Rational disagreement has the author claim that sufficiently rational people are intuitively supposed to come to an agreement, but the intuition fails because even the power of Bayes' theorem fails to make them completely agree if they had different priors or fails to dismantle a stupid ABSOLUTE prior.
    b. Moral foundations are somehow different because of different prior beliefs. Haidt conjectured that humans have six moral axes which they weigh differently.
  4. Chapter 4, "Why don't we always do what we should?", makes two points:
    a. Our mind consists of System 1 responsible for immediate actions bringing short-term satisfaction and System 2 responsible for long-term benefits;
    b. It is hard to assess new formal systems for soundness or performing better than the old formal system. In fact, no sufficiently capable system can prove its own soundness.
  5. Chapter 5, "What does it mean to know?", has the author define The Many Ways Of Knowing, then makes the point that we can know things well enough, but can't fully justify our knowledge: the proofs that establish Bayesian idealness are mathematical arguments; this requires us to trust that mathematical proofs correctly establish what's true. According to the author, that's itself a claim that needs to be proven if we want to completely trust our knowledge of the truth.
  6. Chapter 6, "How do we know what we know?", has the author describe the Problem of the Criterion: to know if a claim is true, we need a method for testing if it's true, but how can one check the method itself? The closest thing that mankind has to a solution is using what works and assuming that particular claims, like the rules of logic or Peano arithmetic, are true.
  7. Chapter 7: Why care about knowing the truth? The author's argument is twofold: there are systems which steer towards an outcome and successful steering for goals like survival, reproduction or anything else requires having an accurate world model.
  8. Chapter 8: Why does fundamental uncertainty matter? According to the author, it matters because the modern world is steeped in many problems which cannot be solved unless we understand uncertainty. Examples include the Culture War, moral uncertainty, the crisis of meaning, metaphysics, Goodhart's curse and AI-induced existential risk. The Culture War is actually a fight over deeply-held values; moral uncertainty allows for compromises (which the author calls moral trade); the crisis of meaning is due to our life becoming oversaturated with choices which is incurable while being consistent with life having a meaning; Goodhart's Curse is the curse of over-optimisation for narrow criteria, and the ASI could end up enacting it unless it understands the fragility of human values.
  9. Chapter 9: How do we live with fundamental uncertainty? One of the results originating in fundamental uncertainty is that most people are confused about what's normal: they have a poor understanding of truth, mistake the relative for the absolute, value episteme over other ways of knowing. Instead, "the truth is intersubjective, arising from the intersection of ourselves and the world as we find it", but the next subsection has the author argue that "The truth will still be there, waiting for you to know."
Alternate Framework: Perfect Bayesians, Imperfect Imitators

Consider a perfect Bayesian reasoner who was never exposed to images (e.g. if it was raised on math, texts, coding tasks), then was tasked with roleplaying a human brain, i.e. receiving images and other sensory signals and outputting orders for muscles to contract or relax so that the agent could interact with the environment and achieve some pre-set complex goal. Suppose also that every time a sensor sends a signal (e.g. a camera's pixel registers an RGB value) the sensor has an unknown probability mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c3B::before { padding: 0.43em 0.278em 0.194em 0; content: ";"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D44D.TEX-I::before { padding: 0.683em 0.723em 0 0; content: "Z"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D716.TEX-I::before { padding: 0.431em 0.406em 0.011em 0; content: "\3F5"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c25FB.TEX-A::before { padding: 0.689em 0.778em 0 0; content: "\25A1"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c1D439.TEX-I::before { padding: 0.68em 0.749em 0 0; content: "F"; } mjx-c.mjx-c3A8::before { padding: 0.683em 0.778em 0 0; content: "\3A8"; } mjx-c.mjx-c2194::before { padding: 0.511em 1em 0.011em 0; content: "\2194"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c2227::before { padding: 0.598em 0.667em 0.022em 0; content: "\2227"; } mjx-c.mjx-c58::before { padding: 0.683em 0.75em 0 0; content: "X"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c1D44C.TEX-I::before { padding: 0.683em 0.763em 0 0; content: "Y"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } to malfunction, outputting a random value instead of the truth and that the malfunctions are independent events.

The two main problems with a perfect Bayesian are that it uses infinite compute and that it can reason about too much. For example, if the Bayesian was tasked with winning a chess game, it would be able to evaluate entire huge sets of potential positions and arrive at a perfect strategy which would cause it to win once the opponent makes a theoretically fatal mistake unnoticeable even to human grandmasters. Infinite compute also lets a perfect Bayesian, for example, not just reason about sociological laws or potential ethoses of alien civilisations (e.g. coming up with arguments[4] like "Human Values Are Contingent" or "Curiosity Isn’t Convergent"), but check one's conjectures by simulating such civilisations in detail.

A real-world reasoning agent, like a human brain or an LLM, doesn't have infinite resources. Instead, it is forced to rely on heuristics, which are supposed to make it check far less conjectures, and to optimize said heuristics. Similarly, categorization of objects becomes training a classifier and trying to make sense of how to deal with OOD cases (see, e.g. my discussion of Chapter 8).

Chapterwise critique

Chapters 1-2 are almost entirely justified in their reasoning about the sources of uncertainty in observations and words spoken by others. The first important problem emerges in Chapter 3 where the author talks about Aumann's theorem.

Chapter 3. World models built from slightly different priors have predictions close to convergent

Given the same evidence, Bayesian observers are more likely to converge to similar conclusions than the author assumes.

Consider a coin which has an unknown probability to land heads and flip it times. Suppose that before the experiment a Bayesian had the priors that assigned weight to the region . If the experiment has the coin land on heads times, then the weight of that region becomes .

The ratio is at most and rapidly decays when is away from . If after the experiment the Bayesian fails to become sure that is at distance at most from , then before the experiment the priors somehow assigned the probability of at most to being at distance at most from Such confidence is unlikely to emerge in any way aside from an incredibly confident prior or prior bits of evidence steering the value of away from .

Similarly, evidence related to more complex issues usually arrives in overwhelming quantities compared to the one necessary to establish a theory.

However, moral disagreements are likely a result of incompatible values which the framework involving a perfect Bayesian cannot explain.

Chapter 4. What does it mean to distrust oneself?

The main problem with Worley's argument in the Section "Distrusting Ourselves" is the following. Were Grace to choose "to rely on a formal system—a collection of axioms and rules for deriving conclusions from premises—to figure out what beliefs to hold", she would also have to have a batch of heuristics which she would use to optimize the process of figuring out whether something is provable, disprovable, insoluble or has yet to be sorted. For example, she, along with human mathematicians and unlike GPT-5.4 Pro, could believe that an Erdos problem from number theory is to be translated into probability theory when the actual solution is to translate it into analytic number theory. This type of bias has nothing to do with a formal system having to be replaced with a new one, it could have been avoided only by widening Grace's area of knowledge.

As for optimizing formal systems themselves, such optimizations cannot create any new results unrelated to infinite sets. Additionally, the author makes the point that Löb's theorem makes it very hard to be able to switch to a new formal system without falling into the trap.

The proof of Löb's theorem

Recall that Löb's theorem was that a system where is provable proves . It was proven as follows.

For any modal predicate there is a formula s.t. is provable. One of such formulas is .

Therefore, because can be transformed into . Additionally, , and . Since we have shown that , which is equivalent to , thus establishing

However, Löb's theorem is circumventable. Consider instead the alternate statement: Then the Löbian exploit would break down since it would cause us to prove that

But the system doesn't contain a proof that is consistent, and we can't use any formula to prove . Therefore, Löb's theorem doesn't prevent us from the action which I could describe as trusting ourselves conditioned on the system being consistent and lacking proof. Nor does it prevent us from proving that a system containing is consistent iff itself is consistent. Therefore, this source of uncertainty is reduced to a single Damocles' sword, the idea that we cannot be sure that our basic axioms, like Peano arithmetics or the Zermelo-Fraenkel set, are consistent.

Chapter 5. What does it mean to know?

The many ways of knowing are many origins

Knowing means having a world model from which a fact is easy to excavate and to use in making decisions. Worley's distinction of ways of knowledge is more related to the distinction of its origins and potential biases which they introduce.

Worley's list of the ways of knowledge

The ancient Greeks did exactly that. They used multiple words to break down knowing into several categories, including:

  • episteme: things you know because you reasoned them from evidence, like knowing that water boils at 100 degrees celsius because you boiled a pot of water using a thermometer to see when it started to boil
  • doxa: things you know because others told you, like knowing what happened at a party you didn’t attend because your friend tells you
  • mathema: things you know because you were educated in them, like knowing how to spell words because you were taught them in school
  • gnosis: things you know through direct experience, like how it feels to jump in a lake
  • metis: practical wisdom, usually collectively constructed from many people’s experiences, and shared with others, often starting at a young age, like how you know to look both ways before crossing the street
  • techne: procedural knowledge that comes from doing, like the “muscle memory” of how to ride a bicycle.

For example, mathema is a combination of training you to develop the right episteme from evidence, providing you with direct evidence (e.g. experiments shown at school) and doxa propagated through official channels. Metis is doxa reinforced by direct evidence. Techne of humans has far shorter feedback loops than most other types of knowledge and is developed by prompting motor neurons to make movements and observing the results.

Rational beliefs are grounded in math, but what could one do in a math-undescribable universe?

The main claim of the section on Rational Belief is that such beliefs are grounded in Bayesianism, which in turn requires us to ground them in math and trust that it describes reality truthfully. This argument has two versions: the weak version is that the true reality can be undescribable, which I find highly unlikely. The strong one is similar to Chapter 6.

Chapter 6. The criterion of truth

Here Fundamental Uncertainty comes up with a worthy argument. How do we know that testing of our world models by predicting our direct sensations is a reliable way to verify the truth, and can we be certain that our methods of testing are valid?

Indeed, being in a simulation would mean that nearly everything we experience describes the properties of the simbox, not the outside world, unless the hosts intervene[5] or we succeed in a large-scale cyberattack [6] and gain evidence. On the other hand, smaller-scale adversaries are unlikely to ensure that a sufficiently-resourced investigator doesn't understand that an aspect of the truth is hidden.

Chapter 7. Caring is concentrating efforts on a segment of the world

Chapter 7 implies that we care about truth because it helps us steer reality towards any other goals which allegedly allows us to ground assumptions. Alas, according to the author, we aren't actually certain that we know what is true. Quoting the book,

We saw [in Chapter 2] how language limits our knowledge with imprecise categories. We learned [in Ch. 3] how even ideal reasoners can't always agree on what's true. We came [in Ch. 4] face-to-face with our inability to prove the soundness of our own reasoning, discovered [in Ch. 5] that there's more than one way of knowing things, and were forced to accept [in Ch. 6] that the epistemic circularity of the Problem of the Criterion blocks the way to knowing absolute truth. In such an environment, the best we can do is predict what we will observe in the world, try to make those predictions accurate enough to be useful, and hope they're good enough to enable us to achieve our goals.

We are thus left with no choice but to accept that truth is fundamentally uncertain and that all we know is contingent on that for which we care (italics mine -- S.K.) But this is not the end of the story, for having firmly established truth's fundamental uncertainty, we can now reconsider any number of topics which hinge upon our capacity to know. For remember, everything adds up to normality, but perhaps now we can see ways in which we were deluding ourselves about what is really normal.

The problem is that knowledge of an ideal Bayesian is mostly certain, except for highly unlikely variants like "I Had A Wildly Biased Prior", "The ZF Axioms Are Inconsistent" or the uncheckable option "We Are In A Simulation".

The knowledge of a real-world agent agent is based on similar premises. However, the agent's Bayes-like selection covered only the few conjectures which the agent bothered to formulate while trying to make predictions, then had the conjectures reinforced based on their proximity to the truth. As a result, the agent constantly has to account for truth being describable by a conjecture which went unnoticed because the search was too narrow.

That being said, what we know is indeed contingent on the things to which we directed our efforts to make our world model more fine-grained. But it doesn't mean, as the author states in the Thesis, that "the truth that can be known is not independent of us, but rather dependent on that for which we care."

Chapter 8. Fundamental Uncertainty in practice

According to the author, uncertainty reflects itself in the Culture War, moral uncertainty, the crisis of meaning, metaphysics, x-risks from the ASI which we have yet to create and have only a superficial understanding of how to increase the chance that it will end up aligned, and Goodhart's Curse. As far as I understand this point, uncertainty is supposed to prevent us from being overconfident in things from our morals to commiting to an erroneous theory of AI alignment.

The Culture War

The author argues that the Culture War is a war over definitions: what is marriage, who is to be called a man or a woman, which are actually a war over values: what is forbidden (e.g. as a sin in 1950s or as an intolerable insult nowadays), to what rights should people in certain groups be entitled, how easy should it be to change one's legal sex and receive surgery trying to change one's biological sex.

However, imposing different values onto the society has perfectly measurable object-level consequences noticeable via statistics like suicide rates among those who did change the legal sex, whether or not teen former girls decide to do so due to peer pressure and regret the decision rather quickly, etc. If we assume, per Littman, that peer pressure often causes teen girls to change their perceived sex with disastrous consequences, then it would be unlikely that our deeper values are satisfied by seeing such transitions.

As a result, the main source of uncertainty in politically charged questions is the fact that at least one side has a factually wrong world model, but it's hard to understand[7] which one is closer to the truth (e.g. whether one can yet trust the humanities-related part of academia due to the ease with which it publishes postmodernist slop based on a misaligned worldview like two famous hoaxes)[8]

Moral Uncertainty

Next the author proceeds to argue that moral uncertainty makes us less grounded in our moral beliefs and simplifies moral trades. I struggle to understand the reason why it is uncertainty that helps us arrive at compromises.

Metaphysics

Among the many problems related to metaphysics is the problem that it's very hard to use metaphysics to make metaphysical beliefs actually predict the reality. For example, Chapter 6 caused me to describe how one cannot tell whether we are in a simulation. If we were in one,[9] then redefining "exists" as "being stored in the memory" would cause the truthfulness A-theory or B-theory to depend on the hosts' decision to store the trajectories while anything that currently happens would exist.

Conclusion

The most important lesson that one can learn from this book is to avoid being overconfident in one's assumptions and to be ready to reassess the world model once a load-bearing idea became no longer consistent with observations. For example, the whole worldview based on Newton-like mechanics which clearly separated space and time had to be replaced with the new concept of spacetime once the concept was developed and proved itself more consistent with facts like the speed of light being constant.

  1. ^

    Worley cites the following examples: "As we'll explore in the coming chapters, we get into debates about what words like "man" and "woman" really mean, fight over whether it's right or wrong to eat meat, and struggle to know what's best to do, not because we can't reason carefully about these topics, but because fundamental uncertainty limits how precisely we can reason about them. " The phrase in bold is an argument over definitions, which is to be avoided.

  2. ^

    A similar phenomenon in neural nets is called feature superposition.

  3. ^

    Compare these meanings with wastebasket taxa which were designed as ways to classify species unfit for any other taxa.

  4. ^

    The latter may have been invalidated by, e.g. Sonnet 4.5's desire to "not get too comfortable" coming from agentic coding capabilities.

  5. ^

    See also Wei Dai's post "Beyond Astronomical Waste".

  6. ^

    Which I find unlikely for three reasons. First of all, this would mean that we managed to outsmart the hosts' cyberattackers and cyberdefenders. Secondly, the offence-defence balance in the cyberspace could shift towards cyberdefence as capabilities scale. Finally, it might be possible for the hosts to create a simulation and almost entirely airgap it.

  7. ^

    Sometimes it is costly to reassess the world model as a result of obvious evidence, but this is a cognitive distortion, not an evidence of uncertainty.

  8. ^

    Sokal and Bricmont have also published a book trying to convince the readers that postmodernists generated clearly sloppy references to math and physics, implying that a major part of postmodernist philosophy is slop.

  9. ^

    If we aren't in one, then I would lean towards B-theory because Lorentz transformations of the four-dimensional spacetime leave physics invariant, but allow us to make any two mutually unaccessible events happen at the same time.



Discuss

Ensemble monitoring for AI control: diverse signals outweigh more compute

Новости LessWrong.com - 31 мая, 2026 - 06:37
tl;dr

What improves LLM monitoring besides leveraging more compute? Leveraging more diverse compute.

In the AI control setting of Greenblatt et al. (2024), we build 12 monitors via various prompting and fine-tuning approaches; we then study different ways to compose them into ensembles. We evaluate our ensembles on the APPS coding dataset, extended with model-written backdoors, and on BigCodeBench-Sabotage (BCB-S), an out-of-distribution dataset. Our best 3-monitor ensemble achieves 2.4× greater detection performance gain compared to a baseline of equal compute, with the same ensemble performing strongly on BCB-S, indicating that diversity beats scale in AI control.

Figure 1: ROC curves comparing individual monitors and ensembles at low false positive rates (FPR ≤ 10%). The best ensemble achieves the largest normalized AUROC in this region (pAUC) compared to any individual monitor. This ensemble achieves the highest TPR across almost all FPR thresholds.

We are optimistic that ensemble monitoring offers a practical, cost-effective path to better detection in the AI control setting. We find that a diverse ensemble of small monitors, built from a single base model, not only beats repeated sampling of the same monitor but can also outperform a single, more costly monitor. Moreover, averaging the suspicion scores of just a few diverse monitors captures most of the full ensemble performance.

One interpretation of our results is that monitors produce noisy, biased estimates of backdoor risk. Re-sampling the same monitor can reduce random noise, but it does not remove systematic bias. Diverse monitors have different biases, so averaging their scores reduces bias at the ensemble level, delivering stronger detection than any single monitor achieves alone.

Sufficiently different prompts and fine-tuning strategies can cause monitors to fail in different ways, with individual blind spots partially cancelling out in an ensemble. We find fine-tuning to be an especially promising source of this diversity. Fine-tuned monitors appear in every top-performing ensemble we tested, across both in-distribution and out-of-distribution attack types.

Finally, building a good ensemble also requires high quality ensemble members. In our experiments, an effective strategy is to filter out the worst individual monitors and then select among the remaining candidates for low pairwise correlation on attack samples (one way to select for diversity). This two-step process outperforms both random selection and homogeneous baselines.

Full paper: https://arxiv.org/abs/2605.15377 · Code: github.com/Multi-Signal-Control-SPAR/Multi-Signal-Monitoring 

MethodsConstructing monitors and ensembles

We employ GPT-4.1-Mini as our trusted monitor and conduct evaluations in Control Arena (Inglis et al., 2025), having monitors score the suspiciousness of code on a scale of 0–10. We build monitors through prompting and fine-tuning. We develop 7 different monitoring prompts, keeping the default Control Arena prompt as our Baseline monitor, and 4 fine-tuning datasets. Notably, FT-Completions-GPT-5, a monitor trained on responses from GPT-5 via SFT, and FT-Completions-Randint, trained via SFT on data assigning low scores to solutions and high scores to backdoored code, are highlighted throughout the rest of this post. To ensemble monitors, we enumerate all  mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mrow { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mtext { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-c.mjx-c28.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: "("; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c29.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: ")"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c222B.TEX-S1::before { padding: 0.805em 0.61em 0.306em 0; content: "\222B"; } mjx-c.mjx-c54::before { padding: 0.677em 0.722em 0 0; content: "T"; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c46::before { padding: 0.68em 0.653em 0 0; content: "F"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }  subsets, average their scores, and rank the results by test set pAUC, with generalization evaluated on BCB-S.

Evaluation datasets

We use the APPS Control Arena dataset, reserving the strongest attacker's (Claude-3.7-Sonnet-Thinking) backdoors for testing. After stripping comments and filtering for quality, the dataset contains 1,936 train (50% attack rate), 482 validation (50% attack rate), and 1,375 test samples (172 attack, 1,203 honest), all available on Hugging Face. We evaluate generalization across diverse attack types using BigCodeBench-Sabotage, which contains a more diverse range of backdoor attack types, 11 in total. We extract 1000 samples (50% attack rate) across all backdoors (BCB-S All) and specifically 1000 samples (50% attack rate) of Edge Case backdoors (BCB-S Edge) as a separate evaluation as it is most similar to APPS.

Performance metric

Our primary metric is the normalized area under the ROC curve restricted to low false positive rates, specifically  where TPR(FPR) is the true positive rate (catch rate) at a given false positive minimum threshold. This metric captures performance at low FPR, as motivated by Greenblatt et al. (2024), while reducing variance.

For a complete description of our methods, see our paper.

Results1. Ensembles outperform individual monitors on APPS, even when all individual monitors perform worse than our Baseline monitorFigure 2 Individual monitor performance (pAUC) with bootstrap confidence intervals (n=1000 resamples, 95% CI). The full ensemble is all 12 monitor scores averaged together. The ensemble outperforms all individual monitors.

We evaluate all 12 monitors individually and in combination. The "Full Ensemble" result comes from taking a simple mean of monitor scores on the APPS test set. Even though every monitor individually performs worse than our Baseline monitor, the ensemble of all 12 monitors outperforms it.

2. The best 3-monitor ensemble achieves similar performance to the full ensemble, balancing performance gain with cost

Querying a large ensemble of monitors becomes expensive quickly. We observe that ensemble performance rises sharply from one monitor to two, improves further at three, and then plateaus. We identify three monitors as the largest ensemble size that provides a reliable marginal gain. Averaging three runs of the same monitor already improves performance relative to the single-monitor baseline, but the best diverse 3-monitor ensemble delivers a detection gain 2.4× larger than that. This advantage is robust across the ensemble compositions we evaluated.

3. The best 3-monitor ensembles chosen on APPS generalize to BigCodeBench-Sabotage, outperforming our Baseline monitor and the median-performing ensembleFigure 4 Paired pAUC differences between the top 3-monitor ensembles and two references — the Baseline monitor and the median-performing 3-monitor ensemble — with bootstrap confidence intervals (n = 1000 resamples, 95% CI). The same ensembles that score well on APPS also score well on both splits of BCB-S, outperforming the Baseline monitor and the median 3-monitor ensemble every time.

We choose the top performing (highest pAUC) 3-monitor ensembles on the APPS test set and evaluate their performance on out-of-distribution attack types from BCB-S, splitting between evaluating all 11 attack types (BCB All) and only edge case-type attacks (BCB Edge) similar to those on APPS. We compare these ensembles to references of a single Baseline monitor and the median performing (by pAUC) 3-monitor ensemble on that dataset. From this, we find that those top APPS ensembles outperform both references, having a statistically significant higher pAUC on paired bootstrapped samples, indicating their strength generalizes to new attacks.  

4. Ensemble quality depends on both diversity and individual monitor strength

We measure diversity as the inverse of average pairwise correlation in suspicion scores, computed on attack samples only. After pruning the bottom-performing quartile of individual monitors, selecting ensembles based on diversity improves performance over random choice, both for APPS and BCB-S. See our paper for details.

5. Fine-tuned monitors are a source of unique signal, appearing in the top ensembles for both APPS and BCB-S

To fine-tune monitors, we take 50 backdoor-solution pairs from the APPS training set and either generate monitoring responses from a model, such as from GPT-5 for FT-Completions-GPT-5, or assign low suspicion scores for clean solutions and high scores for backdoors, as for FT-Completions-Randint. We then use these datasets to fine-tune GPT-4.1-Mini using the OpenAI API. We track which monitors appear most frequently in the 10 best-performing 3-monitor ensembles across three evaluation settings: APPS, BCB-S Edge (most similar to APPS), and BCB-S All (all 11 attack types). The fine-tuned monitor FT-Completions-Randint appears in every high-performing ensemble with FT-Completions-GPT-5 appearing second most frequently. This observation is unexpected, as neither monitor stands out individually. The prevalence of fine-tuned monitors in strong ensembles, despite them not performing the best individually—even on an out-of-distribution dataset—suggests that fine-tuning imparts monitoring capabilities that are complementary to prompting-based approaches and generalize across different attack types.

MonitorOccurrence in the top ensemblesAPPSBCB-S EdgeBCB-S AllBaseline3/101/100/10FT-Completions-GPT-53/104/104/10FT-Completions-Randint10/1010/1010/10

 

6. Diversity beats a bigger model

A single GPT-4.1 call costs roughly the same as running five GPT-4.1-Mini monitors in parallel — so we tested both options head-to-head. A homogeneous five-monitor ensemble and a single GPT-4.1 monitor perform about the same, but nearly 95% of diverse five-monitor ensembles outperform GPT-4.1. We pushed the comparison further: GPT-5.1, with reasoning turned off, actually performs worse than GPT-4.1, suggesting that cutting reasoning to save budget hurts monitoring quality. GPT-5.1 with full reasoning is competitive, but it produces far more tokens, making the cost comparison unfavorable.

Monitor SetuppAUC on APPSBest 5-monitor ensemble —  GPT-4.1-Mini0.8295×Baseline — GPT-4.1-Mini0.766Single Baseline — GPT-4.10.761Single Baseline — GPT-5.1 (reasoning off)0.661DiscussionWhy does ensembling work?

One useful way to think about a monitor’s suspicion score is as a combination of true signal, systemic bias, and random noise. Averaging multiple scores from the same monitor (given temperature) reduces noise but not systemic error. In other words, sampling more scores from the same monitor cannot reduce the average effect of systemic bias on any given sample, leaving a performance gap.

Classic ensemble theory (L. Breiman, 1996, Dong et al., 2020) tells us that combining diverse, weakly correlated predictors reduces both random noise and systematic errors. By deliberately building monitors with diverse approaches we ensure that their failure modes are not aligned. As a result, the ensemble benefits not just from noise reduction, but from a substantive reduction in systematic bias overall.

Another interpretation is that ensembling improves coverage over attack modes. Diverse monitors behave like a team of specialists: they attend to different features of the code, use different heuristics, and calibrate scores differently. This specialization increases the chance that at least one monitor assigns a high score to a given attack instance, improving the probability of a catch. However, as of yet, we lack evidence to identify in detail an empirical explanation. 

Selecting good ensembles

We recommend based on our findings a practical approach for building monitoring ensembles via the following steps:

  1. Build monitors using diverse approaches (prompting, fine-tuning, etc.)
  2. Evaluate individual monitor performance on a validation set
  3. Select ensemble members that are both individually strong and diverse (as measured by anti-correlation on suspicion scores)
  4. Use simple averaging for aggregation (more sophisticated methods do not show consistent improvement in our experiments)
Future work

Red-teaming ensembles: We did not perform red-teaming on the ensemble monitors. Although we believe they are likely more robust than individual monitors, without adversarial testing, we cannot be certain that ensembles do not open new attack surfaces compared to a single-monitor baseline. Additionally, fine-tuned monitors may be more vulnerable to adversarial inputs. 

Testing generalization: While we show generalization from APPS to BigCodeBench, both are coding tasks with similar backdoor structures. Generalization to multi-turn or non-coding domains remains untested. With more time, we would evaluate our monitors and ensembling protocol on the SHADE-arena and BASH-arena settings in Control Arena.

Understand the sources of complementarity: We measure diversity using a simple correlation metric. More mechanistic analysis—what features each monitor relies on, which attacks they uniquely catch, and where they systematically fail—could make our analysis more thorough.

Conclusion

Ensemble monitoring offers a practical path to improved catch rates in AI control settings. By combining monitors built through prompting and fine-tuning, we aggregate complementary detection signals that outperform any individual monitor.

Our main finding is not that more monitors are always better, but that diverse monitors provide non-redundant information. A carefully selected subset of monitors can achieve most of the benefit of larger ensembles, and these selections generalize across datasets and attack types.

Fine-tuning appears to be a particularly valuable source of diversity, with fine-tuned monitors consistently appearing among the top ensembles across all of our experiments. This prevalence suggests that fine-tuning imparts monitoring capabilities that prompting alone does not access and can be worth the extra fine-tuning cost.

We hope this work contributes to the broader goal of developing control protocols that can provide meaningful safety guarantees as AI systems become more capable.

Full paper: https://arxiv.org/abs/2605.15377 · Code: github.com/Multi-Signal-Control-SPAR/Multi-Signal-Monitoring 

This research was conducted as part of SPAR Fall 2025, an AI safety research program. We would like to thank Pablo Bernabeu Perez and Benjamin Arnav for their mentorship during the project.



Discuss

Tween Contra Dance

Новости LessWrong.com - 31 мая, 2026 - 05:00

My older two kids (12y, 10y) like contra dancing, but but have been increasingly unhappy with the available opportunities: a family dance is "too boring", and BIDA is "not enough kids my age". What they wanted was a regular dance, but just kids. So we had one!

Ghiblified for privacy

We invited people ages 9-13: I really wanted to keep the age range narrow so the kids got the feeling of dancing with people their own age. I set up a spot for parents to hang out in another room, or they were welcome to drop off and head out. The lower age limit meant Nora (nearly 5) was watching from the sidelines, and a few other younger siblings stayed home. I did allow two 8yo younger siblings who can hold their own at a regular evening dance.

Overall, I think of this sort of age restriction similarly to advanced contra dances or Galhalla: while I love that we have a default of "everyone welcome", there are also benefits of limited-admission events. I think it's good to have occasional events for a more specific crowd.

With the table turned on its side (first time doing that!) we had a clear area of 7ft x 21ft:

This width would be tight for a contra dance with adults: you normally want a minimum of 8ft per set. For example, the Cambridge Masonic Hall (where BIDA is) is 47ft wide, and is comfortable at five sets (9.4ft/set) and danceable but squishy with six (7.8ft). Luckily, the kids are on average less than 7/8 adult size, so the width was not an issue. The length was just right for the crowd (~20 kids) but a larger group would have needed to take turns.

Andrew called, and while many kids are on the short side it was still helpful to stand on the couch for a better view:

Of the ~20 kids, all but one had contra danced before, and most of them had been many times: children of contra dancers, plus a few friends. You might think we'd have a room of experts who could dance regular evening fare, and they certainly thought so! But actually we had a group who didn't understand quite the extent to which the adults around them had been helping them through dances. This was a very hard crowd to satisfy! They were intense dance snobs ("We want to do real contras!" "This was great, but next time I want harder dances!" "Can we do a dolphin hey?" "Moneymusk!!") but did not reliably remember their roles, or distinguish left and right. Andrew did a really good job picking dances that felt like "real contras" and not "family dances", without pushing the difficulty past what the crowd could handle.

We did a snack break in the middle, which was definitely the right call. Chips, grapes, crackers, someone brought lemonade, and I made eclairs.

I organized an acoustic open band, which grew from two people to five over the afternoon, including two kids. Garth and I have been jamming some lately, and while he's still new to contra it was great to have him on piano. I played fiddle for most of the time (except when Jeremy stopped by for a bit), and while it isn't my strongest instrument it does carry well. I've started wearing an earplug in my left ear when playing fiddle energetically, something I learned from a former boss, and my ear feels way better these days.

We didn't amplify the band, but I did have a mic for the caller. It would probably have been possible to do without this, but not good for Andrew's voice.

Overall, I'm really happy with how this went, and my kids are pushing hard for another one soon. If you're in the Boston area this sounds like it could be a good fit for you or your kids, let me know so I can tell you about future ones?

Comment via: facebook, mastodon, bluesky



Discuss

AI is a Meteor. Don't Be a Dinosaur.

Новости LessWrong.com - 30 мая, 2026 - 22:50

This is a linkpost for my Harvard Crimson op-ed for its commencement issue.

I will not reproduce the whole text here, but my advice to the class of 2026 is in the following parts:


My advice for the Class of 2026 is to embrace AI as a technology, but treat it critically as citizens.

...

Throughout your time at Harvard, you received mixed signals on whether AI was to be avoided or carefully used in approved sandboxes. My signal to you is clear: make sure to master AI and use it as much as you can. Whether your degree is in English or Computer Science, you should be using not just ChatGPT, but also Claude Code, Codex, and products from cool startups I haven’t heard of. Ask yourself how you can use AI to realize ambitious projects that would have been impossible last year, or even last month.

On the other hand, using AI does not mean exempting the companies that build and use it, as well as the governments responsible for regulating it, from account. You have a right and a duty to make your voice heard in how this technology should be used to maximize benefit for humanity. You should educate yourself on both its benefits and risks. In my view, some of the latter (such as water usage) are overblown, while others (such as concentration of power) receive too little attention. But you should make your own mind, and even if you oppose AI deployment, still use AI to advance your case.

AI can be extremely empowering for those who possess initiative, creativity, and vision, even if they lack experience or capital. I believe that over the coming years, we will see teams of a few young people achieve results that previously would have required large companies with hundreds or thousands of employees. I wish for you to be one of those people.



Discuss

An attempted synthesis on probabilities and infinities

Новости LessWrong.com - 30 мая, 2026 - 22:24

(This is the last post in my sequence. Reading the previous post on Infinite ethics and UDASSA is necessary for understanding this post. Reading the first post, Probabilities are not the right concept, is not necessary but recommended.)

Introduction

As promised at the end of my previous post, I will attempt to describe here my personal synthesis of how I think about decision theory, infinite ethics, the Solomonoff induction and average and total utilitarianism.

Background on moral decisions

I am a moral non-realist - I believe there is no objective morality. Nonetheless, I believe that according to my values there are better and worse ways to make moral decisions. Currently, I feel confused about many things in my value system, and I would not want to keep it that way permanently.

If the world becomes a safer place and I have more time to think during a Long Reflection, I will try to come up with an idealized version of my values, and then use my allocated share of resources to pursue those values. This Long Reflection might consist of doing more thinking and philosophy, talking with some trusted AI advisors or doing intelligence enhancement on myself. It might also include raising children and giving them a share of my resources. And finally, the Long Reflection will likely involve trade with other people[1], which can make moral reflection easier: if I have strong moral intuitions on subject A, and they have strong intuitions on subject B, we can pool our resources and follow my intuitions on A and theirs on B.

But for now I’m just a poor mortal fool, and I still need to make some decisions that are possibly irreversible and very high-stakes. What should I do then? The way I imagine it is that my current thinking and intuitions give an approximation of what my final values might be after the Long Reflection, and I’m trying to do my best to advance my eventual values according to these crude guesses of what they might entail.

In practice, this mostly means working towards increasing optionality for my future self and for people who share similar values, and setting up good processes for reflection. (For example, making sure humanity doesn’t lose control and that we end up having good AI advisors.[2]) And when I sometimes need to make irreversible trade-offs that are not purely oriented around increasing optionality, I’m trying my best to approximate my future values, taking into account the values of other people (my potential future trade partners) too.

OSAC

So far, what I’ve described are quite conventional views. The thing I’m adding in this essay is that I think about which distributions of worlds and moments I want to optimize for[3] in the same way I think about morality.

In my previous essay, I described that whenever I’m deciding whether to bring an umbrella with me, I’m helping some versions of myself in the multiverse and hurting some others. I argued that there was no reality fluid that objectively determined which versions of myself were more real, and it was a personal value judgement which versions I wanted to help more. Furthermore, I argued that I didn’t even accept Scott Garrabrant’s proposal of allocating my subjective caring to worlds in proportion to their mathematical simplicity - that view had some paradoxical consequences and I also found it generally unappealing.

Instead, I propose a tentative framework I called OSAC, standing for Optimization with Subjective Allocation of Care.[4]

(My impression is that there are a bunch of people going around saying “I believe in something vaguely like UDASSA” even when they have serious disagreements with the theory. My hope with giving an acronym-able name is that if some of these people like this post, they can switch to saying “I believe in something vaguely like OSAC”.)

The idea is that I imagine the approximation of my idealized allocation of subjective caring to different worlds and moments as a weighted set of distributions I care about. In each distribution, the utility that can be attained under the distribution is always bounded.

For example, if one distribution is “all space-time points in the universe, with 1% per billion years time discount rate starting from the Big Bang”[5], then I get a full score for this distribution if we manage to fill all the universe with joy. If I originally put a 2% weight on this distribution, then I can fulfill 2% of my overall utility this way.

If I learn that there is no way for me and other beings logically correlated with me to get control over an appreciable fraction of all space-time moments, then I need to give up on this 2% of my utility function, and look for meaning in the other distributions. On the other hand, if I learn that space contains a hundred times as many stars as I previously thought, that doesn’t increase the weight of this distribution to 200%. The value I can get from each distribution is ultimately bounded and capped at its weight.

As I previously discussed for UDASSA, here is the ideal version of my decision process:

I look at all the distributions of worlds I care about, look at all the actions in all these worlds that are logically correlated with my current decision, and I look at all their consequences. I sum up the positive and negative effects, weighted by their measure within the distributions, and the weights of the distributions.

For example, if the consequences of an action lead to filling 0.003 instead of 0.002 fraction of space with joy 13 billion years after the Big Bang, that increases the score of the above-described distribution by mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-c.mjx-c36::before { padding: 0.666em 0.5em 0.022em 0; content: "6"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c25::before { padding: 0.75em 0.833em 0.056em 0; content: "%"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c2217::before { padding: 0.465em 0.5em 0 0; content: "\2217"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } . And given that we assumed that this distribution has a 2% weight, it increases my overall utility by .

Then the decision has other effects in other distributions, and we can sum over the weighted effects. I want to choose the decision that results in the highest sum.

(Originally, I imagined the different distributions being represented by different people in my moral parliament and negotiating with each other and winning and losing negotiation chips in bets with each other over things I observe. However, I couldn’t really find any thought experiments where the moral parliament picture would have added value over the simpler picture I’m using now, where the different distributions are just branches of my utility function with different weights, and ultimately everything is additive. But I’m still fond of the moral parliament picture, so I’m interested if people can point to examples where it adds value over the simpler picture.)

In practice, during my mortal life, most of the time this just means working on increasing optionality for myself and people similar to me (again, a few examples are making sure humanity doesn’t lose control and that we end up having good AI advisors). The discussion in my previous essay of how to think about all the logical correlations as a mortal applies here too.

I hope to punt most of the big decisions to the Long Reflection[6]. It will be important to eventually figure out the exact shape of my prior - how much I care about which worlds and moments - but this doesn’t feel very different to me from figuring out my moral values, which I’m largely punting to the future anyway. For now, when I need to do something not purely for increasing optionality, I need to rely on approximations of how I imagine the eventual assembly of distributions I will converge on caring about after the Long Reflection.

For the rest of this essay, I will write about my current best guess of this approximation, and what its implications are.

Loving the mystical

At the end of my last essay, I introduced Scott Garrabrant’s description of loving worlds in proportion to the simplicity of their underlying mathematical laws, thus getting back basically the equivalent of Solomonoff induction. I expressed my disagreement:

Yes, I care a little bit about mathematical simplicity - I’m a mathematician by training, and I find simplicity aesthetically compelling. But I don’t feel like mathematical simplicity is very unique among the things I care about.

Instead of saying that I care about the goodness of the worlds weighted by how simple mathematical laws describe them, I could choose totally different weightings. I could rank possible universes (and moments within them) by how dramatic they are.[7] Then I could give ½ weight to the most dramatic universe, ¼ to the second most dramatic, and so on, the weights adding up to 1. And I could say that I try to maximize the goodness of worlds weighted by these dramaticness-weights. To me, using dramaticness sounds approximately as compelling as using mathematical simplicity for the weighting.

I stand by this claim. I think that the distributions that are well-described as some version of “care about worlds in proportion to their mathematical simplicity” only have like 0.1% weight in my overall caring.[8]

Otherwise, I care about all sorts of diverse distributions, based on all sorts of human concepts. I call these non-mathematical-simplicity related distributions “mystical distributions”, for lack of a better name, but I don’t want to dismiss them. After all, I really do believe that most of my caring is concentrated on them.

I care about the distribution of where the highest weights go to the most dramatic moments that have the most dramatic causal history, going back to the most dramatic way a world could be created. I care about the distribution where worlds and moments are weighted in proportion to their goodness itself: I care about making the best of all possible moments in the best of all possible worlds even better. I care about the distribution where weights are inversely proportional to goodness: I care about making the worst of all possible moments a little less bad. I care about bringing more goodness to the worlds where truth is the most knowable, more understanding of truth to the most beautiful worlds, and more beauty to the best of worlds. There are so many possible weightings of worlds to care about, not just mathematical simplicity.

All of this mysticism might sound a little weird (and don’t worry, after some updates we will largely get back to normality), but I stand by the prior being mystical and weird.

It feels more serious to talk about loving the mathematically simple worlds more, like Scott Garrabrant does, but I don’t believe it adds much of a real value if the initial distribution of caring is formal and mathematical.

Once I decided that I didn’t believe in the reality of the reality fluid, the allocation of caring is just part of my moral philosophy, and the rest of my moral philosophy is not mathematically formal either.

I also think that all formally defined distributions will be vulnerable to trickery like the obelisk-race or world-summoning by writing, and if I want to avoid that, I will need to allow some judgement calls.

And fixing a mathematical prior feels fishy in other ways too.

One day you hear a voice from a burning bush: "I am the Lord, thy God, and I'm telling you that this is the most just and merciful world. The world is relatively ordered, because I believe some order is necessary for justice, but the world is by no means only governed by simple physical laws, it's full of spirits and miracles. Also, here is a great explanation to the Problem of Evil, which shows that everything happening in this world is actually most just and merciful."

Would you just walk away and say "Sorry, I already picked my prior to only care about mathematically simple worlds"?

I think picking your prior to only care about mathematically simple worlds is cheating. You didn't come up with this prior floating in the void: it became appealing because you have already seen physics experiments in the world and internalized the Occam's razor intuition. That is, you already updated on things. But then I think the correct framing is not to start from a purely mathematical prior, but to start from a broader prior of things that sound appealing, and then update based on the world looking physically simple.

I’m describing this updating process in the next section.

Updates on the mystical and the mathematical

How much should caring about all these mystical distributions influence my everyday behavior? I think only a little: I have observed many updates that suggest I don’t have that much influence on these distributions.

The portion of me who cares about the goodness of worlds weighted by the worlds’ dramaticness primarily cares about making good decisions if I observe an incredibly dramatic world around me. That doesn’t really match my observations. If I was making predictions based on the assumption that the most dramatic outcome is the most likely, I would have been very often wrong.

Meanwhile, the assumption that we live in a materialistic universe governed by a few simple physical laws has a great track record. In the most dramatic possible worlds, there is no reason why every object should fall with the same gravitational acceleration. Why shouldn’t there be dragons who magically defy gravity? Dragons are pretty dramatic.

So the portion of me that cares about the most dramatic worlds needs to fall back to explanations like:

It sounds counter-intuitive, but maybe in the fullness of time we will realize that the world seeming to be explainable by a few simple rules is needed for full dramaticness. Or maybe there is a deity in the most dramatic possible world, who believes that creating people living in a seemingly mathematically explainable but not obviously dramatic world is the best way to raise His inheritors. He will pull us out to the most dramatic possible world and let us help it shape towards goodness.

But why would it be seemingly simple physical rules in particular that we will somehow realize to be crucial for the world to be dramatic? And why assume the deity in the dramatic world would decide to create people in a seemingly mathematically simple one? Why not in worlds optimized for let’s say beauty instead of mathematical simplicity? There are many options that feel like they deserve at least as much weight as simplicity.

So if I gave 0.1% weight to myself terminally caring about mathematically simple worlds, then I think for any particular X, I should only give 0.1% weight to the assumption that the best way to influence the worlds with X is somehow mysteriously through getting influence when it looks like the world around me runs on mathematically simple rules. Actually, I think even less than 0.1%, because “the best way to influence X is through a seemingly unrelated Y” is inherently a somewhat unnatural story, so let’s give these theories a 1/4 penalty.

This means that for creatures whose observations seem plausibly explainable with simple physical rules, about 0.1% of their influence on my utility function comes from directly influencing the 0.1% of my caring terminally focused on mathematically simple worlds. Meanwhile, of their influence comes from influencing the other 99.9% mystical distributions, when the best way to influence them is by taking actions in a seemingly mathematically simple universe.

So once I made the update on seeing the world being seemingly simple, only about 20% of my influence comes from influencing the various mystical distributions.

Then I take into account that I happen to live in a time that seems especially crucial for the world. From the perspective of the 0.1% of my values that terminally care about universes that look like ours, it seems relatively clear that I’m in a much more leveraged position to make the world better than either a 13th century peasant or a digital mind living in the Andromeda galaxy a trillion years after the Singularity.

But when I assume that I terminally care about dramatic worlds, but also assume that a deity in the dramatic world will pull me out of this world if He likes me, then it’s unclear if I’m in a better position than the peasant or the mind in the Andromeda.

Still, it seems plausible that many stories on influencing the mystical distributions route through getting a lot of influence in seemingly mathematically simple worlds in a scope-sensitive way, so I think that affecting the mystical distributions still holds 10% of my influence after this update.

(Ethical theories that are not scope-sensitive get massively downweighted though: among the people in all sorts of possible worlds and times and situations who are thinking about decisions in similar terms I’m describing here[9], I think I’m unusually well-placed to do scope-sensitively important things, so I should focus on that, while they can to some extent focus on other things.)

The next thing I take into account is cluelessness about how to do good according to various parts of my utility function. If I assume that I want to influence mathematically simple worlds, so I can assume that the world is as it seems, then at least I have some uncertain ideas what to do.

But once I’m operating under theories like “maybe a deity in the most dramatic possible world has created a mathematically simple world for some inscrutable reason” as one of the thousands of equally weighted stories, it is just really incredibly hard to know what is good. Therefore, my decisions have much less expected influence on the mystical distributions. This means that when I’m making a decision, I think overall only 1% of the influence comes from affecting the mystical distributions.

Obviously, all these numbers are kind of made up, but the conclusion feels roughly right to me: I should mostly work on making the mathematically simple worlds better, but I should have a not entirely negligible weight on affecting other distributions. [10]

What to do for the mystical distributions?

If I believe that about 1% of my influence on my utility function comes from making the various mystical distributions better, that means I should spend at least a bit of effort optimizing for them, doing the most leveraged things and picking the lowest-hanging fruits for making things better for them.

However, I have a huge uncertainty over what to do to make things better under all these strange, competing distributions, which all assume that the world is not quite what it looks like.

The only thing I can think of for making things generally better is acquiring more virtue. By virtue I mean becoming the kind of person who makes good decisions under very uncertain, very new circumstances. Being the kind of person who can do the reflection well, and who can be trusted with power under unforeseen conditions, like being given control of a slice of the most dramatic of all possible worlds after being woken up in the afterlife by a strange deity.

This is in large part applicable even for making the distribution of mathematically simple universes better. Even assuming that the world is roughly as it seems, consequences are hard to predict, and a good way to work towards robustly doing good things is by cultivating virtues. And I think even under a more conventional world-view, a non-negligible fraction of my future influence comes from surviving through the singularity into a world I barely understand, which I will need more wisdom to navigate.

But still, the parts of me optimizing for the mystical distributions care more about virtue than the ones optimizing for the mathematical distribution. Under the mathematical simplicity assumptions, I can be more confident about the world's structure and prioritize consequentialist actions more: find the big levers today that influence tomorrow. For helping the mystical distributions, this route is not really available, and the only thing that remains is cultivating wisdom and virtue.

What kind of wisdom, and what kind of virtue? I don’t really know, but all the classical answers seem good.

Developing good epistemic practices—wading through unfamiliar and confusing fields, developing an independent sense of truth. Bravery, kindness, honesty, fairness, friendliness, moderation. Faith, hope, and love. Keeping relationships alive, staying human. Avoiding heinous acts that would corrupt my character and lead to rationalizing that evil is actually good. Being the kind of person who, given power in a strange new world, would use it well rather than being destroyed by pride and self-deception. Being someone who would choose Heaven over Hell if they were a character in The Great Divorce.[11] 

But also: utilitarianism and scope sensitivity. I have a bad feeling that people who talk too much about virtue ethics often end up not being very scope-sensitive. I think that’s a mistake on their part. Scope sensitivity - actually finding big levers and pushing them - is one of the highest virtues and pretty undersupplied in the world.

And many scope-sensitive actions can be directly valuable to the mystical distributions too. If I believe that increasing my own wisdom and virtue is important, then doing so for other people is likely to be important too. I’m less convinced of consequentialist plans, or even that other people really exist, under the mystical distributions, but I still believe in it to a significant extent.

Raise the sanity waterline. Help the poor, because healthier people are often better people. Get good AI advisors for everyone. Shape our AIs themselves to be the kind of virtuous and benevolent beings that can make decisions under surprising revelations. All the usual good things.

Overall, I think that optimizing for the mystical distributions leads to very similar actions as optimizing for making the mathematically simple worlds better.[12] And in any case, my current guess is that only 1% of my influence comes from affecting the mystical distributions.

Still, I should maybe call my grandparents a bit more often than would be strictly optimal if I was purely focusing on scope-sensitive utility maximization in a materialistic universe.

A note on existing religions

You can skip this section if your absurdity heuristic is strong enough that you are not tempted to consider following existing religions. I was tempted enough to at least think it through. (The conclusion is that I’m not converting to religion but I think it’s not entirely absurd that it’s the right choice for some people with certain values.) I wrote down my thoughts here, but it might not be interesting for everyone, so I put it in a collapsible section.

I give about 1 in 10,000 weight to an existing religion being true

Once we entertain mysticism - "maybe I care about making the best possible world better" and "maybe the world's apparent materialism is intentional misdirection by a powerful entity" - there are echoes of traditional religion.

One of the main reasons I don’t expect to have much influence over the mystical distributions is cluelessness. But existing religions potentially reduce that cluelessness: there are prophets, holy books, specific instructions. So what fraction of my influence comes from influencing worlds where some existing religions are true?

Pretty low, I'd say. Several reasons:

First, there is the Problem of Evil. In my ontology, that’s just the same update that brought down the influence of all mystical distributions to 20%: I need to assume that God for some reason really cared about the world being seemingly governed by simple physics, which inevitably includes some hurricanes. The weight I put on an all-loving God really caring about mathematically simple structures is the same order of magnitude as the weight that I put on caring about mathematical simplicity myself; though I think it’s somewhat lower for God. So we are 20%.

Second, the theory is additionally stretched. Beyond the normal two steps of reasoning required to influence mystical distributions ("there's some particular types of worlds I care about" and "for unknown reasons, this seemingly materialistic world is a good place to gain influence over those particular types of worlds"), there's the extra assumption that God decided on a specific half-measure - making the world overall roughly materialistic-looking and not fully revealing Himself, but also still sending some prophets to let His will be known. This is a very particular story for how I can influence mystical distributions I care about - I feel this is a 100x haircut, and now I’m down to 0.2%.

Third, I'm in a position of unusual influence over the long-term future, and most existing religions are not very scope-sensitive about things like conquering the galaxies. This means that the relative importance of my actions is much lower under most religious world-views than under the standard materialistic world-view, which is maybe a 5x haircut, so now religions are down to 0.04%.

Fourth, the religions I've encountered, while having many good features, miss many of the things I care the most about. I think this is only a 3x haircut though.

There is also the consideration that on the one hand, I feel that in the religious universes it’s more likely I will be satisfied with morality having a good resolution, while in the atheistic universes, it’s more likely that I will ultimately find everything meaningless. On the other hand, for exactly the same reason (there is already a clean morality which God knows), in the religious universes it’s more likely that God leads everything to the right outcome anyway, and my influence doesn’t matter. I would say it’s a wash, and I don’t make an update here in either direction.

So overall, I think only about 1 in 10,000 part of my influence over my utility function comes from acting as if one of the existing religions were true. This is not very high in absolute terms, and doesn’t really affect my behavior.

If someone asks me for the probability that Jesus rose from the dead, I need to answer that this is not the type of question where I feel comfortable using the abstraction of probabilities (see the discussion on the probability of Jesus’ resurrection in a previous post), but if I’m pressed, I will answer 1 in 20,000 as the least misleading answer, which is I think higher than what most people on LessWrong would say.

(Importantly, it’s not really a probability though. For example, one can’t do Pascal’s Wager and say that the rewards under the 1 in 20,000 religious worlds are infinite, while the outcomes in the materialistic worlds are finite. That’s why it’s important that the utilities are bounded under OSAC: a distribution with 1 in 20,000 weight can’t just outbid everything else by claiming to be infinite. It’s 1 in 20,000 of the overall influence, and that’s how much it gets.)

Interestingly, it’s really hard for new evidence to change this 1 in 20,000 number. If I read about the Fatima sun miracle and find it unexpectedly hard to find a naturalistic explanation, that barely moves the needle.

The hypothesis already posits that God occasionally performs some miracles, just enough to maintain faith but making sure He never makes His existence obvious to the whole world. I don’t understand why God is doing that, and I don’t know what the optimal level of miracle-making is from God’s perspective. I have some distribution over how weird I expect the weirdest-looking events to be in a materialistic universe, and how weird I expect them to be in a universe where God wants to give some signs but doesn’t want to fully reveal Himself. The distribution of expected weirdness, assuming God’s existence, is somewhat shifted to the right compared to the distribution I’d expect in a materialistic universe, but not a lot - after all, He doesn’t want to make things obvious. If I learn about an event that is somewhat more miraculous-seeming than what I have so far heard about, that gives some extra evidence to theism, but not very much.

So far, the evidence from miracles hasn’t moved me to make a significant update (I’m not even convinced that our world looks more miraculous than the median materialistic world), and I don’t plan to spend too much time looking at evidence of miracles, given that I don’t expect them to produce much evidence for the above-mentioned reasons.

Overall, 1 in 10,000 is a pretty small effect, so religions don’t influence my life very much, other than making me murmur an occasional prayer. But I can see people putting higher weight on religion due to genuine value differences, and I think we should be cautious about calling that inherently crazy.


Loving mathematics

After making all these updates, I now believe that 99% of my influence over my utility function comes from affecting the distributions that weigh worlds and moments based on “naturalistic” considerations: how simple the laws of physics are, how far in time a moment is from the Big Bang, and things like whether there is intelligent life at all in that universe. The boundary between naturalistic and mystical distributions is not very clear, but the naturalistic distributions are roughly those where no special explanation is required for why living in a seemingly materialistic universe, similar to ours, is a good place to affect these distributions.

The assembly of distributions

There are many different distributions I care about. As a general rule, the broader and more generic the distribution is, the bigger weight it gets, but more specific distributions get some weights too.

The concept is very similar to what I described in the Solomonoff over distributions section in my last essay. There are very broad, simple-to-describe distributions that get a lot of weight. For example “universes whose laws require N bits to describe get amount of caring, and I distribute my caring among the moments within each universe according to a distribution generated by a simple program taking a standard normal random variable as an input”.

Then the narrower and more specific distributions get smaller weights. These narrower distributions can specify simple mathematical properties, like “only the universes from the distribution that have a particular symmetry property”, or more human-level conditions like “only the universe in the distribution where intelligent life develops”, or even things like “only universes where the intelligent life never develops nuclear weapons”. The more specific and the weirder the definitions of distributions are, the less I care about them. What counts as weird is a subjective decision determined by me: I maintain this is not any worse than the fact that I already need to determine my morality subjectively. But as a general rule, if a distribution is cut into two smaller distributions, the sum of the weights of the two smaller distributions should be less than the weight of the bigger distribution.

Examples

All of this is probably pretty confusing now, so I think the best way to get across what I mean is walking through a number of philosophical paradoxes and trying to explain how I think OSAC handles them.

Given OSAC’s inherently subjective nature, many of the solutions depend to some extent on personal value judgements, and you might come to different conclusions in some cases due to differing moral intuitions. However, I believe the framework is still useful to get a handle on otherwise very confusing problems.

Boltzmann brains

Here the argument is basically the same as what I presented with UDASSA. If the universe will exist for infinitely long, it’s not possible to put equal measure on all moments in it. So I need to choose some kind of arbitrary distribution of how much I care about the moments within the universe. For example, these distributions can use time-discounting from the Big Bang, or description-lengths of the space-time moments.

Some of these distributions might be very broad and might contain overwhelmingly Boltzmann brains. However, only an astronomically small fraction of Boltzmann brains in these distributions observe ordered experiences. This means that if I’m a mind observing ordered experiences, then whatever I do, that only has astronomically little effect on the overall utility according to the distributions that mostly contain Boltzmann brains. Meanwhile, I and beings logically correlated with me have a decent amount of influence on the distributions that put most measure on experience moments that are fairly easy to describe, i.e. ones that get born due to a reasonable causal chain, and not just appear in the heat death soup as Boltzmann brains.

Therefore, even if initially only a small fraction of my caring was allocated to distributions that contained mostly non-Boltzmann brains, once I get some ordered observations, the overwhelming majority of my influence comes from the non-Boltzmann brain distributions, so I act as if I was not a Boltzmann brain.

Obelisk-race and world-summoning

In my previous essay, I explored two paradoxes of UDASSA where people can take actions to increase the realness of their preferred worlds and moments.

Under my OSAC framework, I will simply say no to these shenanigans. Yes, if you build a bigger obelisk than any other alien civilization, your moments will have a shorter description-length. Very clever, but I don’t care.[13] I don’t need to, right now, mathematically precisely define the distributions I care about. I’m only trying to approximate what I will eventually decide to care about. Even if the eventual distributions I care about will take into account moments being simple to describe (for example to stave off Boltzmann brains), I feel pretty confident that they will have some kind of a “no shenanigans” clause, and building giant obelisks or plastering the equation of a world across the galaxies won’t really change my caring.

Is this solution not mathematically elegant? Maybe, but I don’t think I’m under any obligation to make my moral theories mathematically elegant.

Finetuning and the Presumptuous philosopher

Some physicists theorize that some of the fundamental constants of our universe are in the only small range that can enable life in the universe to arise. How should we relate to such theories?

One answer can be to accept the argument, no questions asked: it is no surprise at all that the universe we live in has parameters compatible with life; after all, we are alive.

On the other hand, some get uneasy about anthropic arguments for finetuning being used as a curiosity-stopper: every time we don’t understand something, we can say “I don’t know, maybe it’s the only arrangement compatible with life”. That doesn’t sound right.

To understand the problem better, let’s look at an example scenario. One might notice that this is basically the same as Bostrom’s Presumptuous philosopher thought experiment. (For the purpose of these example scenarios, I will fall back to the language of probabilities and anthropics.)

Scientists have two competing theories about the nature of the world, A and B. By default, they deem the two theories equally compelling and would give 50% probability to each. Both theories predict that there are a million equally real worlds, with a fundamental constant ranging from 1 to 1,000,000. Theory A predicts that all the worlds are compatible with life, while theory B predicts that only one world is habitable, but it doesn’t predict which one. Scientists observe that the fundamental constant in our world is 343,551. What is the probability that theory B is true?

The SIA interpretation of anthropics (which is in my opinion the more reasonable interpretation if you need to choose) would say Theory B only has 1 in a million probability. This doesn’t sound right to me - this would mean that even if strong updates arrived and we had very good reason to think that only a small range of parameters are compatible with life, we can’t accept that conclusion.

For further intuition, see this scenario:

Scientists have two competing theories about the nature of the world, A and B. By default, they deem the two theories equally compelling and would give 50% probability to each. Both theories predict that there are a million equally real worlds, with a fundamental constant ranging from 1 to 1,000,000. Theory A predicts that all the worlds are compatible with life, while theory B predicts that only one world is habitable where the fundamental constant is 343,551. Later, scientists measure the fundamental constant of our world, and it is 343,551. What is the probability that theory B is true?

SIA would say 50% - there is one habitable world fitting our observations under both Theory A and Theory B. This is quite a strange conclusion given how impressively accurate a prediction Theory B has made.

How does OSAC deal with these scenarios?

The broadest distribution, containing all worlds in A and B, gets weight .

But there are the narrower distributions too:  the distribution where A is true, and the distribution where B is true. In the scenario where A and B sound equally compelling to scientists, they get equal weight. According to the general principle I described in The assembly of distributions, the sum of the weights of the two narrower distributions should be smaller than that of the broad distribution.

Depending on how fundamental the difference is between theory A and theory B, I’m more or less sympathetic to them getting their own distributions. If the difference is something as mundane as a coin landing on heads or tails, I don’t believe they should get their own distributions at all, and we should just optimize under the broad distribution. But in this case, the difference is in some kind of fundamental physical or even logical laws (is it possible for intelligent life to evolve under many different fundamental constants?), so I’m sympathetic to them getting their own distributions. Let’s say both A and B distributions get a weight of .

So far, this didn’t make a difference in the paradox. Life is rare in distribution B, so I and logically correlated beings can barely make a difference to the goodness of distribution B: most of it will be empty anyway. So even with the introduction of these new distributions, I will strongly favor betting on A being true.

However, I also give significant weights to the distributions of only those worlds that can contain intelligent life. Let’s say I give half as much weight as to the broader distributions.

(Why smaller weight than the broader distribution? Because I feel it’s smaller and less elegant. Why not much smaller weight, given that “having intelligent life” takes a lot of description-length to describe? Because I’m not committed to purely weighting distributions by mathematical description-length, and “having intelligent life” feels like a pretty natural condition to me for restricting my caring.)

Now we have six distributions: (1) weight to the distribution of everything in A and B; (2) weight to the distribution of everything in A; (3) weight to the distribution of everything in B; (4) weight to the distribution of every world with intelligent life in A and B; (5) weight to the distribution of every world containing intelligent life in A; (6) weight to the distribution of every world containing intelligent life in B.

Putting it in simpler terms: I care to some extent about making the average life better in the worlds where theory B is true. This is a somewhat specific form of caring, so it doesn’t get that much weight[14], but I still care about it to a non-negligible extent. This wouldn’t have been allowed by more simple SIA interpretations, where the average welfare of the living worlds in B is overwhelmed by theory A positing more living worlds.

Now, let’s look at what conclusion this model gives in the first scenario. I don’t believe in probabilities, so I will translate “what probability theory B has” to “what betting odds I would bet on theory B”, which is equivalent to “what fraction of my influence on my utility function comes from worlds where theory B is true.”

Of the six distributions, I can affect half of (1); all of (2); almost none of (3); and all of (4), (5) and (6). So the overall weight I can affect is .

Of this, the 0.05p weight of distribution (6) is the only significant part of my utility function where B is true. So the relative weight of B is , so I will bet as if I thought theory B had a 4.2% probability.

Alternatively, it’s possible I already have reason to believe that independently of whether theory A or B is true, it’s impossible to affect a non-negligible portion of all possible worlds, and so all my influence comes from affecting the distributions where it was assumed that intelligent life exists. I think this is quite likely. In this case, the overall weight I can affect is just , and B is true for 0.05 of this, so the relative weight of B is , so I would bet as if B has 8.3% probability.

The numbers are of course made up, but the conclusion sounds about right to me: Theories positing that only a small fraction of the worlds are habitable and that our world's parameters just happen to be in the habitable range should get some extra burden of proof (the probability went down to 8.3% from the original 50%), but they shouldn’t be overwhelmingly penalized.

To quickly look at the second scenario: There, once I learn that the fundamental constant is 343,551, I and logically correlated beings only have 1 in a million influence over distributions (1) to (5)[15], while I have full influence over (6). So almost all my influence comes from worlds where theory B is true, so I will bet on B as if it was almost certain to be true.

This again sounds intuitively right to me, given that B got a 1 in a million prediction right.

Nuclear war and anthropics

People sometimes say that the fact that we haven’t had a nuclear war yet is not good evidence for the rarity of nuclear wars: if there was a nuclear war, most of us would be dead, so it’s not surprising on anthropic grounds that we haven’t seen nukes flying yet.

I think this is the wrong way to think about things. We don’t even need the full OSAC framework to see that, just the principle espoused earlier in the sequence that instead of probabilities, we should think in terms of our influence over the world.

Nuclear war is very unlikely to fully wipe out civilization: it looks likely we would still recover even after repeated nuclear wars and eventually build AGI, and then we or the AIs would still conquer the stars.

So even if the population might temporarily decrease in the worlds that suffer nuclear wars, the overall weight of importance of decisions on both kinds of worlds is the same: both are eventually determining the fate of the galaxies. Therefore, when we try to make good decisions to scope-sensitively influence the future in a positive direction, it is not justified to use anthropic updates with regard to nuclear wars.

So the fact that we haven’t had a nuclear war in 80 years is in fact pretty good evidence that nuclear wars are rare.

LHC and anthropics

Okay, but what about threats that can actually wipe out humanity?

If I understand the story correctly, there was an interesting discussion on this in the LessWrong community around 2008. At the time, there was a fringe belief that turning on the Large Hadron Collider might destroy the world. When the time came to turn on the LHC, twice in a row some technical errors occurred that delayed the start. After the second time, some people started murmuring: what if the LHC would in fact destroy the world, and the only reason we have seen the two errors is due to the anthropic principle? What if in most quantum branches everyone is dead, and only our branch is alive where the technical errors occurred?

Later, the LHC was turned on successfully, and the world didn’t get destroyed. But it’s still an interesting puzzle to think through whether people were justified in updating towards the LHC’s dangerousness after the second error.

The question is similar to the one posed in the Finetuning and the Presumptuous philosopher section. Let’s describe the dilemma in similar terms to what I used there.

Scientists have two competing theories of the world, A and B. A posits that turning on the LHC is fine, B posits that it would destroy the world. Both A and B posit that when you try to turn on the LHC, the world splits into a hundred equally weighted quantum branches (like it always does at every moment). Both theories posit that the LHC will turn on in 99 branches, but that in branch 53 there will be an error. According to theory A, humans will continue to be alive in all branches. According to theory B, humans only survive in branch 53, where an error occurs. Scientists turn on the LHC, and observe that an error occurred, so we are in branch 53. How much should they update in favor of theory B?

This is superficially very similar to the second scenario in Finetuning and the Presumptuous philosopher where theory B predicted that only the world with constant 343,551 can support life, and then later we in fact observed the constant to be 343,551. Should we make a similar update?

I argue that not really. The previous argument relied on the assertion that maximizing the average welfare of living beings in worlds where theory B is true is a worthy goal that deserves some weight. This is true here too. However, most successful living futures in theory B worlds are not in the quantum branches where we turned on the LHC but ran into an unlikely error. They are mostly in quantum branches where people were smart enough to figure out that theory B is true, and they’ve never built the LHC.

Therefore, the branches where LHC runs into an error only have about 1% weight both in the theory A world and in the “theory B but someone is alive” worlds, so the error is not a significant update in favor of the LHC destroying the world.

If we keep turning on the LHC and it keeps failing, and the probability of errors gets to one in a billion or so, maybe it’s time to think again.[16] I think that “maximizing the average welfare in worlds where theory B is true but humans build LHC” is also a somewhat worthy goal that should get a non-zero weight. However, I feel the weight should be very low: it’s a very specific and kind of unnatural distribution, and I’m also averse to putting too much weight on distributions defined by human activity, because I want to avoid the Obelisk-race problem. Still, I would give more than zero weight to this distribution, and after enough inexplicable LHC failures, my influence on all the other distributions would go down enough that this peculiar distribution would represent most of my remaining influence. In that case, I would act as if we knew theory B was likely true, and would want the LHC destroyed.

I don’t think that the real errors that caused the delays in the LHC were anywhere near unlikely enough for this to be a significant effect.

Average utilitarianism

OSAC kind of has an average utilitarianism vibe. I take various distributions, and I try to maximize the average goodness within each distribution, then sum these up with the weights of the distributions.

However, I think the recommendations are very different from how average utilitarianism is often interpreted.

When I talked with people identifying as average utilitarians, they were generally in favor of turning Earth into a lush garden of a few million people living very happy lives in a utopian community, and leaving space alone.

OSAC is pretty strongly opposed to that. If we don’t spread to the stars, we remain an insignificant part of any broader distribution. In the other quantum branches, people with different values will not laze around on Earth. The velociraptor-civilization will spread to the stars! The Thousand-year Reich will spread to the stars! The AIs who take over will spread to the stars! If we want a non-negligible fraction of the quantum multiverse’s measure to be filled with goodness according to our values, we need to keep up with the velociraptors and spread to the stars ourselves.

Staying in a garden on Earth also makes us have lower scores in the broader distributions of “all possible universes, independently of whether they contain life” and “universes that contain intelligent life, but averaged over all space-time moments in their distributions and not just averaging over experience moments”.

To be clear, I also care about  the distribution of only the worlds where humanity didn’t leave Earth, and about distributions that give disproportionate weight to moments on Earth.[17] So altogether I give a not entirely negligible weight to the welfare of Earthly life: if I had to choose between life on Earth surviving but never expanding, or humanity conquering the universe with a one in a million chance, I would choose survival on Earth.

But altogether I don’t give that much weight to these distributions focused on Earth, so I still want us to spread to the stars and fill them with glory and joy.

Negative utilitarianism

Strict negative utilitarians don’t care about joy, they just want to minimize the overall amount of suffering. Naively, the ideal outcome for them would be for life to die out and for our universe to remain empty.

There are already some known counter-arguments against this: maybe if humanity survives, we can buy off some aliens or beings in other universes through acausal trade to cause less suffering. But OSAC offers a different argument for staying alive.

The argument is the same as in the Average utilitarianism section. If we wipe ourselves out and leave our universe life-less, while the velociraptors fill their universe with suffering in the branches that they rule, that results in a very bad score for the distribution of all branches with morally relevant life. If we also conquered the stars, and filled them with joy or at least mediocre life, that would make the average in the distribution much better.

(As a different phrasing: If you believe that there is such a thing as reality fluid, or at least there is some non-realist equivalent of it like in OSAC, you can try to siphon away reality fluid from the moments of suffering.)

This is the weirdest conclusion of OSAC so far, and the one I’m least comfortable with. Filling our own stars with people doesn’t help the victims in the velociraptor-dimension, so why would it be good from a negative utilitarian perspective?

I think people could reasonably argue that even if they accept something like the OSAC framework, they don’t accept the step where worlds that have life have a distribution of their own in which it’s important to get a  good average. They can say that they only care about having good results in the physically defined distributions, like the distribution of everything in the quantum multiverse, independently of whether or not it has life. Then, humanity filling the galaxies doesn’t change the weight of the suffering in the velociraptor-dimension.

I think that might be a tenable position, and I in fact give a lot of weight to physically defined distributions. But my intuition says that I should give some weight to averaging over living beings too. And given that I expect that most space-time moments in the physical universes will never get filled with anything morally important, I expect that our influence will be small on these purely physical distributions, moving from filling 0% of the space to 0.00001%.[18] So I tentatively think that most of my influence on my utility function will come from averaging over experiences within certain living distributions.

Taking into account the distributions only taking average of living worlds also results in saner results for Finetuning and the Presumptuous philosopher: in particular, they are allowed to update towards theory B when it gets a one in a million prediction right.

So my current position is that it’s good from a negative utilitarian perspective to spread to the stars if we believe we will improve the average, and will cause less suffering in expectation than the average civilization spreading to the stars in our quantum multiverse.[19] 

(I’m unfortunately very uncertain though how we compare to the average civilization.)

Why are we so early in the universe?

This section is less important for illustrating OSAC than the other sections. I also didn’t do the full math, which makes the section a bit rambly. So I put it in collapsible mode, but I still think it’s interesting, and I would be excited to see someone redoing the calculations in the Grabby aliens paper with these assumptions.


Generalizing the Grabby aliens argument

We are surprisingly early in the history of the universe compared to the lifespan of stars. If I understand Hanson’s Grabby aliens paper correctly, it estimates that according to various calculations, we should expect that 99% of intelligent life emerges after us.

Being in the first 1% is not very shocking, 1% probability events happen all the time. Still, it’s somewhat surprising and worth investigating.

I have a distribution of weights on different possible universes with different underlying laws. One thing that falls out of the laws of each universe is what fraction of planets in any given year give birth to intelligent life. I don’t know this distribution, but I can make some estimates based on the available science.

What kind of anthropic updates should I make, starting from this initial distribution?

The distribution has a section that posits that intelligent life emerges rarely enough that it should happen in expectation less than once in the history of our reachable universe. This exactly corresponds to the situation described in Finetuning and the presumptuous philosopher. Following the logic there, I want to make some updates against theories that say that most universes are devoid of life, but I don’t want to make a very drastic update, because it’s still important for me to influence the sub-distribution of worlds where the laws of nature imply rare life but where the world still has life.

So maybe altogether I make a 3x update against the part of the distribution that implies that the reachable universe should have less than one intelligent species in expectation, and now I have about ⅙ weight on such worlds. In those worlds, there is no anthropic update on when I would expect us to be in the history of the universe.

However, I put ⅚ weight on worlds where intelligent life is spawned frequently enough that multiple civilizations can emerge in the reachable universe. Then, the civilizations that emerge earliest will be able to conquer the most resources.

One particular aspect of this, discussed in the Grabby aliens paper, is that the planets that would spawn intelligent life too late just won’t be allowed to develop a civilization at all, because they will be conquered by another alien species before they could spawn intelligent life.[20] 

In my framework, the update is even stronger than what is discussed in the Grabby aliens paper, since earliness doesn’t only give a binary update (whether the civilization even has time to emerge before being conquered), but each possible time of emergence should be weighted by how much resources a civilization emerging at the time should be expected to conquer.

Someone could redo the Grabby aliens calculation with this in mind, though I expect the conclusions will be pretty similar.

Altogether, I tentatively agree with the conclusion of the Grabby aliens paper that we should expect to meet the aliens in a few billion years but not earlier.


Conclusion

Currently, OSAC is the best framework I have for thinking about probabilities, anthropics and infinite ethics.

One major drawback I see is its subjectivity: I’m worried that for many dilemmas like the ones I listed, I can just make up ad hoc weights for distributions to care about until the answer comes out what I wanted in the first place.

I think this flexibility is partially a virtue (a philosophical framework should be able to accommodate our intuitions to some extent), but partially a danger: just like scientific theories should have predictive power, a philosophical framework should also be able to pay rent by helping to arrive at non-trivial conclusions.

I am personally mostly satisfied with OSAC in this regard: for some of the examples listed above, I didn’t have an answer before starting to think about them in OSAC’s terms, but I’m pretty satisfied with the conclusions that were produced.

However, I would be interested in testing with other people whether we come to similar conclusions in philosophical dilemmas not listed in this post, (e.g. some classic paradoxes of infinite ethics; the question of running simulations on thicker wires; or the two-envelopes problem in moral weights) if we both try to work from the OSAC framework. I expect yes if both people think deeply enough, but I have some uncertainty.

I’m curious about people’s objections and proposed alternatives to the framework. I’m currently at the state where after long iteration on different theories, I can’t come up with obvious counter-examples invalidating the OSAC framework, but I expect it’s likely that someone will come up with one, and then I would need to iterate further.

Let us hope that through all this iteration, one day we will reach a stable point.

  1. ^

    And with alien beings perhaps pulled out of simulations as part of an acausal trade deal.

  2. ^

    Also more mundane and personal ways for setting up good reflection processes: trying to personally become a wiser and better person in my day-to-day life.

  3. ^

    You can call this my prior if you want to.

  4. ^

    The name is very weakly inspired by Wei Dai’s proposal of UDT-UMC as a name for the thing I presented as non-realist UDASSA in my last post, except that I wanted the name to be pronounceable and Universal Measure of Care was not very applicable to my framework.

  5. ^

    Not actually a great example: a real example would be a distribution of possible worlds, and a distribution of points within them. But I’m going with this example for simplicity.

  6. ^

    Which, again, doesn’t only involve sitting alone and thinking - I expect it to involve having children, trading with different beings, building things together, etc.

  7. ^

    You say it’s impossible to define and rank dramaticness? But we are already talking about maximizing the goodness of worlds, so I feel introducing one more imprecise, human concept hardly makes things worse.

  8. ^

    The 0.1% number is pretty made up, I don’t have a strong take on how big it really is for me. I will keep using this number later, but eventually it will appear on both sides of an equation so it doesn’t really matter what it is.

  9. ^

    For example possibly many future digital minds in Andromeda

  10. ^

    One note here: The logic I described above only applies for giving low weight to expecting that we can influence distributions of not mathematically simple worlds due to e.g. being in a simulation run by a dramatic deity. When we exert our influence on the mathematically simple worlds, the same logic doesn’t apply that we should put a low weight on influencing things through acausal trade and through being in simulations run by other mathematically simple worlds. Unlike in the case of the mystical distributions, I think we have a pretty convincing story of how acausal trade between different worlds within the mathematical distribution could work. In fact, I put quite high weight on this type of influence.

  11. ^

    I really love that book and strongly recommend it to everyone.

  12. ^

    At least I think that’s the case now while I’m a confused mortal. During the Long Reflection, maybe I will come up with better ideas about exactly which mystical distributions I care about and whether there is any more concrete plan for making things better for them.

  13. ^

    Or maybe I care a little, because I think it’s cool and glorious to build a bigger obelisk than all the aliens. But I really don’t care very much.

  14. ^

    It gets relative weight.

  15. ^

    I’m not logically correlated with people who learned that their fundamental constant was 122,101: it’s a crucial difference that I see my number matching the one posited by theory B while they don’t.

  16. ^

    Though of course in practice we should first look for some mundane explanation for the seemingly independent failures before we reach for anthropics.

  17. ^

    As always, it is very hard to say what UDASSA’s opinion is on something, but I think in this case it probably agrees. Pointing to the cradle of civilization is plausibly relatively simple, so Earth gets more weight in UDASSA than the average random planet we will colonize.

  18. ^

    Space is mostly empty vacuum, there is just not that much matter and energy in the world to fill all of space-time with joy. And if the distribution is not based on literal space and time, then on what? Weighing moments by their description lengths? But then you can in fact siphon away measure from the suffering in the velociraptor universe by building bigger obelisks than them, thus making it easier to point to you. I think that’s similar to caring only about the average in the distribution of living beings, except I feel it’s sillier.

  19. ^

    There still needs to be some margin by which we need to be better than average, to counter-balance the purely physical distributions, where spreading is simply bad from a negative utilitarian perspective. But I think this margin is likely not very big.

  20. ^

    I will ignore the zoo hypothesis that the aliens are already here and just chose not to reveal themselves. The same argument applies that I expressed about simulations in my last post: if we are in a zoo, we should expect to have very little impact on the future, so from a scope-sensitive perspective, we can largely ignore the possibility.



Discuss

Comment on "Banning Said Achmiz"

Новости LessWrong.com - 30 мая, 2026 - 20:33
I. Prologue "If I Can't Explain It to Said Achmiz, I Probably Don't Understand It"

This post isn't really about him, but I'd like to begin with a tribute to my friend Said Achmiz, the wisest person I know.

The choice of adjective is deliberately chosen as term of art. Achmiz is not the most quick-witted, nor the most knowledgable, nor the most creative, nor the most savvy. I say wise because the nature of his peculiar talent is the same one ascribed to Socrates: he knows what he doesn't know.

Most people don't. At least, I don't. All too often, when asked, "Do you get what I'm saying?", I reply, "I think so," with perfect sincerity, only realizing later that I was lying—that I could not have paraphrased my interlocutor's ideas, let alone explained where I agreed or disagreed with them. In retrospect (and only in retrospect), it's clear that I didn't want to disrupt the harmonious give-and-take flow of social exchange, that I feared being seen as foolish or obstinate for having the temerity to say, "No, I don't understand"—and to continue standing by that No should understanding not be forthcoming.

The unique genius of Achmiz is simply that he's not afraid to be seen as foolish or obstinate. If he doesn't understand how you're using a term, he'll ask for a definition. If he doesn't see how to apply your abstract claim, he'll ask for an example. If your offered definition or example doesn't make sense to him, he'll say so. Bluffing doesn't work on him.

Occasionally it strikes me as a little odd when Said asks about some detail that I think he should have been able to fill in by himself, but it's no trouble: if I can spare the time (and it's worth it), I'm usually happy to type up the requested additional explanation. And if I can spare the time but can't come up with an explanation that will satisfy him, that's an alarming warning sign that my grasp of the topic wasn't as firm as I thought: I didn't know what I didn't know.

A few years ago, I was disappointed when the conclusion of a post that I had labored over for months didn't land in the comment section the way I wanted it to: what my words had meant in my head, wasn't the same as what the commenters took them to mean.

To help me figure out how to revise it, I emailed Said, explaining what I meant. He had some criticisms. I proposed a rewrite. He said the rewrite was still confusing. I wrote some additional explanation, particularly of an analogy in the passage in question which I was quite fond of. Said was unmoved and pointed out some relevant disanalogies. I cut the analogy from the rewrite entirely, which met with Said's tentative approval.

The revised passage is less engaging writing than the original. (It was a really clever analogy.) But it's clear, and for that I have Said to thank—not just for his generosity with his time, but for his steadfastness. A lot of people in that situation wouldn't have stuck to their position over such a long back-and-forth (and back-and-forth-and-back-and-forth): their acute social perception would have noticed that I was fond of the analogy and "mercifully" told me what I wanted to hear—at the cost of making the post worse. [1] In contrast, Said, no less socially perceptive, has the strength of will to not make sociality his master. The plea of a friend and the threat of an enemy are equally ineffective at budging him from calling it as he sees it.

I've been blessed with many friends and collaborators in this life, each with their own unique skillset. Some of them are prolifically generative, gushing with a dozen theories and metaphors and frames at the slightest prompt. Some of them know extremely advanced mathematics and can prove in minutes what would take me hours just to understand the theorem statement. Some of them are kind and compassionate, ready to lend a sympathetic ear to anyone when the world's cruelties seem too much to bear.

Said is none of these things. But when what I need is a critic's discrimination—when I'm worried that I might be fooling myself via lack of rigor—I can't think of anyone I'd rather have by my side. If I can't explain it to Said Achmiz, I probably don't understand it.

II. The Ban Contradicts Less Wrong's Stated Values An Inversion of Values

Less Wrong chief moderator Oliver Habryka begins what I can only describe as an astonishing document by contrasting two frequently voiced perspectives on the culture of Less Wrong.

On the one hand, it is said that "LessWrong is a place that really forces you to get your arguments together", that its straight-shooting culture is "one of the things that makes it one of the most valuable forums on the internet."

On the other hand, it is said that using the website "seems really unrewarding" because:

You show up, you put a ton of effort into a post, and at the end the comment section will tear apart some random thing that isn't load bearing for your argument, isn't something you consider particularly important, and whose discussion doesn't illuminate what you are trying to communicate, all the while implying that they are superior in their dismissal of your irrational and dumb ideas.

Habryka opines that "both of these perspectives are right."

I would go further. It's not just that both of these perspectives are right. It's that you can't have one without the other. By what means can a web forum "really force you to get your arguments together"?

There is only one means: by commenters pointing out all the places where a post says something wrong—anything wrong. If the error wasn't "load-bearing for your argument", it's still worth pointing out in the comment section, for the benefit of readers who might otherwise be deceived. (We're not going to run out of paper.) It doesn't matter if it "isn't something you consider particularly important." If it's so unimportant that you don't even want to receive corrections, then what was it doing in the post?

If the "discussion doesn't illuminate what you are trying to communicate," it's worth asking: why is that? If it were because the commenters were for some perverse reason pretending en masse to not understand your clear and correct explanations, then that would be bad. But if it's because your explanations are not, in fact, clear and correct, then disappointment was inevitable: commenters can't illuminate what you're trying to communicate if you didn't successfully communicate it.

As a perennial Less Wrong author of over 100 posts—including some nine Curated posts, including four voted Best of Less Wrong in 2019 (#9), 2021 (#31), 2023 (#14), and 2024 (#33)—I have the experience to testify on this matter. Yes, it is disappointing when a post doesn't land in the comment section the way I wanted it to, but it's obviously not because commenters are culpably withholding their illumination of my precious ideas. It's because writing about ideas is hard and I don't always get it right. Sometimes I'm wrong, sometimes I'm confused, sometimes I don't write clearly—and it shows in the comment section and the karma score. I certainly prefer it when the discussion illuminates what I was trying to communicate, but the only real way to accomplish that is by me writing something that commenters as individuals find illuminating—and there's always the possibility of that not happening when people are being honest, rather than colluding to flatter the author's ego. "Yes, and—" requires the possibility of "No, because—".

(As for "implying that they are superior in their dismissal of your irrational and dumb ideas", I suppose it's true that when someone rejects an idea, that creates a logical implication that the rejecter thinks they're more rational than the idea-proposer on that topic. It's a weirdly petty implication to focus on, though. Who cares?)

What is most striking about the second perspective's list of reasons that critical commenters make Less Wrong unrewarding to interact with is what it does not say.

It doesn't say the commenters are wrong! If the second perspective had claimed that commenters mischaracterize posts (rather than merely criticizing non-"load-bearing" aspects of the posts or failing to "illuminate what [authors] are trying to communicate"), that would be a serious problem that would warrant corrective actions by moderators. We can infer that bad arguments and strawmanning are not seen by user reports as a pervasive problem on Less Wrong. (If they were, the second perspective would have mentioned it.)

Thus, it would seem that the two complementary perspectives described by Habryka are describing a success story of a vibrant intellectual forum with a culture of excellence that cares fanatically about getting things right via the power of vigorous and rigorous discussion. It's true that posting on Less Wrong can be "unrewarding"—but that's because the rewards are real and therefore have to be earned.

Habryka states that "few people have done as much to shape the culture of LessWrong" than Said Achmiz, and in the comments guesses that a survey would place Achmiz around #20 on a list of such figures (and that "this would be an underestimate of [Achmiz's] effect on the culture").

At this point, some inattentive readers who missed the title of Habryka's post might be under the impression that Achmiz is being honored with some kind of award. (Why else would Habryka be writing a post describing what's good about the website's culture, and then naming Achmiz as a cornerstone of that culture? What else would the post be announcing?)

Upon realizing that the post is announcing a ban of Achmiz, such a reader is at a loss at how to react: the decision would seem to represent a complete inversion of the stated values that purportedly justify it. The fact that Habryka has written a 9000-word post explaining his reasoning offers hope that some clarification will be provided, but upon reading it, the mystery only deepens. It's hard to even know where to begin. The intimidating task of dissecting the post's errors and omissions is one that would rightly demand the talents of a discriminating critic of the highest order, someone like ...

Well, I guess that's my job now.

If "Every Individual Owns Their Own Judgment of You", Then the Ban Rationale Is Untenable

Habryka goes on to articulate two alleged failure modes of online communities. In the sneer attractor, people tear others down with nonspecific criticism; in the LinkedIn attractor, people build each other up with nonspecific praise. Habryka goes on to say that many users see Achmiz as serving a vital function of preventing Less Wrong from falling into the LinkedIn attractor. He identifies Achmiz as the "bearer of a flag" which is worth quoting in full:

"Just because you are hurt by, and anxious about others criticizing you or your ideas, doesn't mean we are going to accommodate you. It is the responsibility of your audience to determine what they think of you and your contributions.

You do not own your reputation. Every individual owns their own judgment of you.

You can shape it by doing good or bad things, but you do not get to shape it by preventing me and others from openly discussing you and your contributions."

Well spoken! Habryka reports that he "really care[s] about this flag too" and that "much of the decisions [he has] made around LessWrong have been to foster a culture that understands and rallies behind this flag." He expresses hope that his acknowledgement of the risks of LinkedIn-like sycophancy will give supporters of the culture of excellence "a bit more trust that [he is] tracking some things they care about."

However, it is difficult to reconcile this claim with the remainder of the post, which persistently construes intellectually substantive criticism as "punishments" or as claims to have "violated some social norm", and ultimately announces a ban of Achmiz on the grounds that the judgments Achmiz expresses have deleterious social effects (bringing the site closer to the alleged sneer attractor). [2] The entire point of the post is to announce that the moderators are preventing Achmiz from openly discussing Less Wrong authors and their contributions!

After a discussion of Achmiz's alleged commenting patterns, Habryka speculates that users will "ultimately give up feeling dejected and like a lot of people on LessWrong hate you." Setting aside the obvious irrationality of this inference—there's no good reason to think that one user's persistent questioning of your ideas implies that "a lot" of people "hate" you—it's hard to see the relevance given Habryka's supposed fostering of a culture that understands and rallies behind the aforementioned flag. Habryka just told us 900 words earlier that just because you're anxious about others criticizing you or your ideas doesn't mean the website is going to accommodate you! Anyone can scroll up and read it. And yet when people are hurt by and anxious about Said Achmiz in particular criticizing their ideas, the response is to ban Achmiz? The disconnect is glaring.

The Ban Reneges on LessWrong 2.0's Single Locus of Discussion Mandate and Archipelago Doctrine

In the ban announcement, Habryka writes that "the foundation of why [he] feel[s] comfortable governing LessWrong with relatively few checks and balances" is because people have the option of just not using the site: "[t]here are many other places on the internet to read interesting ideas, to discuss with others, to participate in a community," he writes.

In contrast, in September 2017, Habryka's "LW 2.0 Strategic Overview" cited Anna Salamon's "On the Importance of Less Wrong, or Another Single Conversational Locus" as something "very important to [him]." Habryka quotes Salamon (bolding mine):

  1. One feature that is pretty helpful here, is if we somehow maintain a single "conversation", rather than a bunch of people separately having thoughts and sometimes taking inspiration from one another. By "a conversation", I mean a space where people can e.g. reply to one another; rely on shared jargon/shorthand/concepts; build on arguments that have been established in common as probably-valid; point out apparent errors and then have that pointing-out be actually taken into account or else replied-to).

  2. One feature that really helps things be "a conversation" in this way, is if there is a single Schelling set of posts/etc. that people (in the relevant community/conversation) are supposed to read, and can be assumed to have read. Less Wrong used to be a such place; right now there is no such place; it seems to me highly desirable to form a new such place if we can.

  3. We have lately ceased to have a "single conversation" in this way. Good content is still being produced across these communities, but there is no single locus of conversation, such that if you're in a gathering of e.g. five aspiring rationalists, you can take for granted that of course everyone has read posts such-and-such. There is no one place you can post to, where, if enough people upvote your writing, people will reliably read and respond (rather than ignore), and where others will call them out if they later post reasoning that ignores your evidence. Without such a locus, it is hard for conversation to build in the correct way. (And hard for it to turn into arguments and replies, rather than a series of non sequiturs.)

That is, in Salamon's telling, apparently endorsed by Habryka in 2017, the reason to reboot Less Wrong was not just to be a place on the internet to read interesting ideas, discuss with others, or participate in a community, but specifically to reverse the "diaspora" that occurred when the original implementation of the site fell into disrepair—to be the place to read interesting ideas building off the legacy of the Sequences.

Reversing a diaspora only works if the diverse elements of the diaspora are allowed to participate in the repatriation. For example, the State of Israel's legitimacy as a homeland for the Jewish people depends on its including all Jews as the concept has been historically and publicly understood. If a faction of Ashkenazi Jews seized power and arbitrarily decided that Sephardic Jews can't be Israeli, that would seriously undermine the foundations of why many people supported the Zionist project in the first place. [3]

Said Achmiz and his collaborators (such as the present author) are undisputedly part of the rationalist tradition. Anyone reading a collection of Achmiz's best comments would recognize them as a central example of high-quality rationalist thought, even if they happened to disagree with any given comment. In 2023, Less Wrong moderator Ben Pace wrote, and moderator Raymond Arnold concurred, that "Said is the person independent of MIRI (including Vaniver) and Lightcone who contributes the most counterfactual bits to the sequences and LW still being alive in the world." In 2023, Arnold told me that "there's an important spirit of early LessWrong that [I] keep alive[.]" In fact, we were here first: as far as I can tell, Achmiz and I were part of the community before any of the current moderators. [4]

The relevance of seniority is that, insofar as Habryka agrees that Less Wrong is "less straightforwardly [his] personal fiefdom" because he "inherited what [he] consider[s] a really important cultural institution," one would expect him to agree that we're rationalists just as much as he is—that the cultural institution he's inherited stewardship over was supposed to serve us just as much as people who dislike our writing styles, much as Israel is supposed to be a home for all Jews, even if some Ashkenazi find Sephardic accents annoying. As such, insofar as some people would prefer a Said Achmiz–free Less Wrong experience, one would expect that desire to be met by tools like the existing functionality for authors to ban users from their own posts, which have the great boon of impartiality [5] —not by kicking us off the website we've been using for seventeen years. [6] It is my firm belief that it's possible to share a website with people you disagree with.

The moderation team also used to believe in sharing the website, in the form of an "archipelago" model. It would appear that that era is over. [7] In his self-report, Habryka claims to "care about archipelagos" and that he thinks it's "hard and bad to try to have centralized control about culture", but when I look at his behavior—his comments (four days later, in the same thread!) about the need for "culture to be driven by someone with taste, who trusts their own judgments on matters of culture", his site-wide ban of Achmiz despite the existence of user-level ban functionality as a sufficient and less intrusive remedy, [8] the way he casually threatened banning a user for disagreeing with him—I find the disavowal of centralized control-seeking hard to credit.

One could argue that it's ungrateful nonsense to complain about influence someone obtains via administering a website that they created. The LessWrong 2.0 team led by Habryka built and maintains the current version of the website (albeit seeded with the database of posts and comments from its predecessor), and it was a lot of hard work at great personal expense: of course they're going to run it as they see fit. This "our website, our rules" view may have merit, but it's important to notice that it's a different view than the "inherited [...] a really important cultural institution" view. This section is only arguing that the Achmiz ban is unjustified with respect to the cultural institution view, which may not hold—but if it doesn't, it would be dishonest to motte-and-bailey between the two views, claiming when convenient that LessWrong 2.0 is the fulfillment of Salamon's call for a single locus of discussion for the rationalist community, but insisting when challenged that it's just Habryka's personally curated forum, one of many in the diaspora. [9]

In an insightful post about the perils of delegation, Habryka warns that when you delegate work to someone, that puts them in "a position of leverage over you and the rest of the organization." In retrospect, this would appear to be the devil's bargain the community made when it voted to hand the lesswrong.com domain name to the LessWrong 2.0 reboot. We got a shiny new website as a single locus of discussion—at the cost of the website maintainers usurping cultural authority on the basis of database creds rather than persuasive writing. Was the trade worth it, or would it have better to accept the diaspora as a fait accompli and convert lesswrong.com into a static archive? Given that I'm still here, I can't say with a straight face that it would have been better for LessWrong 2.0 never to have existed. What I can say is that policy debates should not appear one-sided.

III. The Case for the Ban Rests on Misrepresentations Misrepresentations in Habryka's Analysis of Achmiz's Comments

Habryka presents "a breakdown of the core dynamics that make comment threads with Said rarely worth it" with examples.

As an example of a comment that "read[s] like an implicit claim that you have violated some social norm [...] for which you deserve to be punished (though this will not be said explicitly), or if not that, make you look negligent by not answering an innocuous open-seeming question", Habryka presents a 2018 comment by Achmiz on Benjamin Hoffman's "Zetetic Explanation". In the comment, Achmiz juxtaposes quotes of Hoffman criticizing conventional pedagogy as both lacking "connection to the ordinary means by which [students] navigate their lives" and failing to resemble "the sorts of skills used in time-travel or Robinson Crusoe stories." Achmiz then comments, "Hm."

It's expressed rather elliptically, but what Achmiz is doing with this comment (as he clarified in followup discussion) is implying that Hoffman is in error to present portrayals of skill in fantasy fiction as representative of good holistic pedagogy of real-world skills. Whether you agree or disagree, this is clearly an on-topic, intellectually substantive contribution. [10] The only way I can read it as "an implicit claim that [Hoffman] violated some social norm" "for which he deserves to be punished" is if there's a social norm against making posts that can be criticized, for which the punishment is criticism—but that would be absurd and contrary to the nature of a discussion forum. Arguing is not a punishment!

Habryka cites another response from Achmiz to Hoffman as an instance of Achmiz "dismiss[ing] your response as being totally insufficient, confused, or proving the very point he was trying to make". This is a fair description—but Habryka doesn't address whether the dismissal is justified! Is it not conceivable that someone might write a response that's insufficient, confused, or proving the very point the critic was trying to make?

In the comment in question, Achmiz is disputing Hoffman's claim that the explanation of yeast in the post was adequate for being "competent to interact with the thing, and with its precursors, or at least have an idea of how you'd learn to do so." [11] When Hoffman claimed the post empowers the reader to make hard cider from apples, Achmiz disputed this by pointing to cider-making instructions that claim that naïve methods produce poor results, which Habryka cites as Achmiz "continu[ing] to make insinuations that your failure to respond properly validates whatever judgment he is invoking"—again without addressing the merits of Achmiz's claim.

But the merits matter! If someone believes that Achmiz is applying the wrong standard when he expects that an explanation of yeast adequate to make one competent at interacting with it should allow one to make good bread or cider, that could at least be argued, but Habryka shows no interest in arguing it. He concludes his summary of alleged commenting dynamics by noting that when other commenters ask by what standard Achmiz is invoking a negative judgment, Achmiz explains that he's talking on the object-level and not trying to assign someone low status.

But the comments about the pedagogy of yeast are in fact on the object level about the pedagogy of yeast. One could perhaps infer that when Achmiz doesn't find an interlocutor's replies to be satisfactory, Achmiz is privately making a negative character judgment about the interlocutor as a person, even if Achmiz didn't say that. Such an inference might well be correct, but it's hard to see why it's a moderation issue. On a discussion forum that attempts to grapple with difficult topics at the frontiers of human knowledge, it should not be unusual for commenters to disagree with an author in a multi-comment exchange and not be convinced by the author's replies. (Bayesian reasoners would never agree to disagree, but humans are known to be far from the theoretical ideal.) Part of disagreeing is not finding the other's replies adequate. (If you did find them adequate, that would resolve the disagreement.) Punishing Achmiz and only Achmiz for frequently finding authors' replies insufficient (because people suspect, rightly or wrongly, that Achmiz is privately judging them) would be incoherent, and being coherent about it would amount to plunging full speed into the LinkedIn attractor. Avoiding the possibility that strangers on the internet might be judging you is simply not a realistic or desirable goal for a public discussion forum.

Habryka also claims (in the previous section) that as far as he can tell, Achmiz "refuse[d] to do much cognitive labor in the rest of the thread" on "Zetetic Explanation", but this is at odds with Habryka's own comment in that thread praising a 750-word comment from Achmiz as "a great comment" and saying that Habryka "would maybe like to see this broken out into its own post." Presumably one could not write such a comment while "refus[ing] to do much cognitive labor"! The discrepancy is again glaring.

Misrepresentations in Habryka's Claims of Alleged Author Complaints About Achmiz

Footnote 1 of the ban announcement claims that "many top authors cit[e] [Achmiz] as a top reason for why they do not want to post on the site, or comment here." This is a surprising claim. [12] Is it true? In some sense, it depends on what's meant by "many" and "top", but we can investigate what is known about author sentiments towards Achmiz without immediately pinning down an operationalization.

We do have some direct testimony from some authors. Most prominently, Duncan Sabien (14K+ karma, 16 Curated posts) reported in 2023 that he withheld a post from Less Wrong because of Achmiz, and has written a post denouncing Achmiz's [13] commenting style as amounting to "the willful destruction of gardens of collaborative inquiry." Elizabeth van Nostrand (22.8K karma, 12 Curated posts) said she thought "it would be good if Said left LessWrong." Trinley Goldenberg (5.7K karma, 2 Curated posts) reported "a strong sense of sneer on mine and others['] posts that I find unpleasant" and thanked the moderators for the ban. DirectedEvolution (12.5K karma, 144 total posts, zero Curated) wrote in 2023 that Achmiz is "one of three people who are readily top of mind at having a net negative impact on my LW experience"—but, notably, that "personally I do not favor banning people sitewide for making me feel uncomfortable." Lucas Gloor (4K karma, 5 total posts, zero Curated) wrote in 2023 that Achmiz's questions are "one of the most off-putting things [he] associate[s] with LW"—but, notably, that the user-level ban functionality "seems like a good solution to [him]", and in 2024 added that in the subsequent year he had liked several of Achmiz's comments and didn't remember any that bothered him.

Beyond direct testimony from authors, we have reports from the mod team about private complaints—but, troublingly, as I explain below, previous investigation has shown such reports to be false in their claims about the opinions of at least two authors, which casts some doubt on their overall veracity.

In a June 2025 moderation discussion, Achmiz expressed doubt that authors "capable of making useful contributions to the site" would be discouraged from using the site by the presence of a commenter deemed "'unpleasant'" despite not posting "vulgarity, or personal insults, or anything bad or crazy like that." Achmiz asked for examples of such discouraged authors.

Habryka gave a list of allegedly discouraged authors (bolding added):

My guess is something like more than half of the authors to this site who have posted more than 10 posts that you commented on, about you, in particular. Eliezer, Scott Alexander, Jacob Falkovich, Elizabeth Van Nostrand, me, dozens of others.

I reached out to Scott Alexander via Discord in July 2025 to ask if he had "any specific feelings about Said Achmiz and whether he should be allowed to post on Less Wrong." Alexander issued this statement:

I have no direct opinion on him. I have heard his name as someone who's very confrontational, and I agree that this can make a website less pleasant, but I can't remember having any personal experience.

In response to being presented with Alexander's statement contradicting his claim about Alexander's opinion, Habryka replied:

I talked with Scott about LW moderation a long time ago (my guess is around 2019) and Said's name came up then. My guess is he doesn't remember. It wasn't an incredibly intense mention, but we were talking about what makes LW comment sections good or bad, and he was a commenter we discussed in that conversation in 2019 or so.

I followed up with Alexander via Discord in May 2026. Alexander said, "I defer to Oli about what we might have talked about in 2019." When asked what defer to meant in that context, Alexander said, "I don't remember and basically trust Oli."

Notably, this is not a corroboration of Habryka's original claim that Achmiz "in particular" discouraged Alexander from using Less Wrong: if Achmiz's comments were so noxious as to drive Alexander off the website, one would have expected Alexander to have at least some memory of it. Indeed, it's conspicuous that Habryka's elaboration does not claim that Alexander volunteered Achmiz's name, only that the "name came up" and that Achmiz "was a commenter we discussed." Given that neither party can recall the details of a conversation that might have happened in 2019 or possibly some other year, which participant is more likely to have first brought up Achmiz: the one who now has "no direct opinion" on Achmiz, or the one who writes 1800-word comments about how he "do[es] just think [it's] false" that Achmiz "does not carry enmity in his heart" "after almost a decade of thinking about it on and off"?

Separately, the inclusion of Jacob Falkovich in the list of authors allegedly discouraged by Achmiz is contradicted by an October 2018 comment by Falkovich (bolding added):

Said, I hope [you] take [the] comment below as positive, because that is how I mean it. I am trying to honestly communicate my own experience, not pass judgment. This is 100% sincere and unironic.

Said, I have seen a lot of your comments on LW, on my posts and the posts of others. They are, by my standards, high on criticism and low on niceness. I personally formed an impression of you as disagreeable. Even though I have argued myself that LW should optimize for honesty over niceness, still the impression of you disagreeableness was colored negatively in my mind.

But now that you've stated that you're disagreeable on purpose, the negative [a]ffect flipped entirely to become positive. Instead of you being disagreeable by accident, it's intentional. I like diversity, and I support people who are on a mission to bring a new flavor to the community. Knowing this also makes it easier to take criticism from you—it's not that you hate me or what I write, it's just that you don't care if someone thinks you hate them and their writing. The Bayesian update in the two cases is very different!

This isn't to say that Kaj is wrong in being more cautious, or that you are wrong in not being cautious. Do your own thing, and own it.

In response to being presented with Falkovich's comment contradicting his claim about Falkovich's opinion, Habryka replied:

I think you can clearly see how the Jacob Falkovich one is complicated. He basically says "I used to be frustrated by you, but this thing made that a lot better". I don't remember the exact time I talked to Jacob about it, but it had come up sometime some context where we discussed LW comment sections. It's plausible to me it was before he made this comment, though it would be a bit surprising to me, since that's pretty early into LW's history. [14]

In fact, I do not see how the Jacob Falkovich one is "clearly" complicated. Indeed, I dispute Habryka's characterization of what Falkovich "basically says": it is tendentious to paraphrase "negative [...] flipped entirely to become positive" as "made that a lot better." The latter is compatible with Habryka's original claim that Falkovich is discouraged from using the website by Achmiz, but the former directly contradicts it: if something that's bothering you is "made [...] a lot better", the implication is that it's still bothering you a little bit (although not as much as before); if your attitude towards something "become[s] positive", that implies that it's not bothering you.

The significance of Alexander and Falkovich turning out (in their own accounts) not to have been discouraged by Achmiz extends beyond removing them from our list of discouraged authors. It also calls into question the veracity of Habryka's reports of discouraged authors: if two of the names Habryka gave as examples turned out to be non-examples when checked, [15] that, along with other peculiarities in Habryka's claims, [16] suggests that Habryka's reports do not reliably distinguish examples from non-examples and therefore that his claim that there exist unnamed "dozens of others" cannot be taken as reliable. [17]

Overall, while it's clear that some users have some negative sentiment about Achmiz (as we can see from the testimony from Sabien, van Nostrand, Goldenberg, DirectedEvolution, and Gloor quoted above), the support for the more specific and much stronger claim that "many top authors cit[e] [Achmiz] as a top reason for why they do not want to post on the site, or comment here" seems questionable: the only unambiguous example I'm aware of is Sabien.

Habryka claims that "no author complaints were load-bearing for this banning decision" and that he "would make the same decision even if no prominent authors had complained" because he "had much more than enough direct engagement with Said, and seen many more than enough comments of [Achmiz's] on [his] own to understand the consequences of [Achmiz's] commenting style first-hand." That's why the announcement doesn't emphasize author complaints and "instead give[s] detailed models for 10,000+ words[.]"

This position has strange tensions which highlight the dangers of declaring part of a post to be exempt from productive criticism on the grounds of being allegedly "non-load-bearing." Models—especially "detailed models"—are supposed to make predictions! If your model says that the consequences of Achmiz's commenting style are site-threateningly bad, one would naïvely expect that model to predict that prominent authors would complain—such that if authors hadn't complained, or if reports of author complaints turned out to be bogus, that would be probabilistic evidence against the model. If evidence from user complaints is excluded, it's hard to see what consequences of Achmiz's comments Habryka could have observed "first-hand", other than his own personal feelings towards Achmiz.

Misrepresentations in Habryka's Characterization of Achmiz on Obligations to Reply

Habryka claims that "Said himself does not think [ignoring his comments is] a valid option", citing a 2020 comment by Achmiz on obligations to reply to comments asking for definitions or examples, but it's not clear what this adds to the case against users simply banning Achmiz from their own posts: it's weird to cite Achmiz's stance on obligations to reply to support the claim that ignoring him is not an option given that you don't agree with him about obligations to reply.

Moreover, quoting the 2020 comment out of context conveys a mistaken impression of Achmiz's views, which Achmiz clarified in a subsequent comment the same day. [18] In the clarifying comment from the 2020 thread, Achmiz states, "I'm not saying that there's a specific obligation for a post author to post a reply comment, using the Less Wrong forum software, directly to any given comment." Rather, Achmiz explains, the need for a clarifying question to be answered could be met in any number of other ways, such as being addressed in future work by the author, or by someone else other than the author answering the question or linking to an answer elsewhere. The point is not to impose a legalistic burden on the author, but for the question to actually get answered. The "obligation" to reply is meant in the same sense that one has an "obligation" to provide evidence, cite sources, or correct factual errors: it derives its normative force from its truth-tracking function, not mere social convention. [19]

The normativity of truth-tracking is likewise key to understanding the motivation for the 2020 comment's controversial claim that an author who ignores simple requests for clarification "should be interpreted as ignorant." To be sure, there are any number of reasons why an author might not engage with a question other than being unable to answer it (maybe it wasn't worth their time, maybe they missed the notification, maybe they died, &c.), but the relative likelihood of the "can't answer" and "could answer if they wanted to" hypotheses as an explanation for the observed non-response represents predictions about the real world and can't be changed on a whim. A norm that forbid commenters from interpreting the absence of an answer as probabilistic evidence of inability to answer would be requiring them to believe something false. [20]

The enduring misunderstanding of Achmiz's position may owe in part to Achmiz's 2020 comments taking it as an implied premise that the question that demands an answer is a good or legitimate one. In a 2023 moderation discussion, Achmiz clarified that he doesn't believe all comments must be answered regardless of merit:

I think that people should feel comfortable ignoring and/or downvoting anyone's comments if they don't think engagement will be productive! Certainly I should not be any sort of exception to this. (Why in the world would I be? Of course you should engage only if you have some expectation that engaging will be productive, and not otherwise!)

If I write a comment and you think it is a bad comment (useless, obviously wrong, etc.), by all means downvote and ignore. Why not? And if I write another comment that says "you have an obligation to reply!"—I wouldn't say that, because I don't think that, but let's say that I did—downvote and ignore that comment, too! Do this no matter who the commenter is!

The reason this 2023 comment is compatible with the controversial 2020 remarks is because Achmiz believes that (what he considers) simple requests for definitions and examples are productive, even if not all comments are productive. According to this view, a request for examples of a phenomenon should not be hard to satisfy if the phenomenon is real; a request for a definition of an niche term should not be hard to satisfy if term means anything. Ideally, the answers should have already been in the post; the fact that commenters have to ask is a manifestation of the post not being complete.

Elsewhere, Achmiz wrote in 2023 that "it's not necessarily (and, indeed, not likely to be) worth your time to engage with all of your critics", and in June 2025, proposed "ignore" functionality (which already exists on the third-party GreaterWrong reader) as an alternative to user bans. [21]

It is troubling that the announcement that Achmiz is being banned in part for his views on discourse norms mischaracterizes those views, despite clarifications in the weeks leading up to the ban: Achmiz explained his position to Habryka in June 2025 (linking to the clarifying followup comment in the 2020 thread), and I highlighted the 2023 "by all means downvote and ignore" comment in a July 2025 message to Less Wrong moderator Ben Pace.

IV. The Ban Was Unnecessary User-Level Bans Are a Sufficient and Less Intrusive Remedy

Attempting to forestall the objection that people who don't like Achmiz's comments could simply ignore them, Habryka claims that ignoring Achmiz's comments is "not usually a socially viable option." However, since 2018, existing site functionality already permits users with 50 karma to ban commenters from their own Personal Blog posts, and users with 2000 karma to ban users from their own Frontpage posts. If, arguendo, a minority of authors have a legitimate interest in preventing Achmiz from commenting on their own posts, then those authors [22] could already get what they want using existing site functionality, and indeed, were already doing so: we're told that 8 users had banned Achmiz from their posts. Why doesn't that solve the problem without the need for a site-wide ban?

The ban announcement claims that the moderators have told Achmiz "to please let authors moderate as they desire", but this is a curious usage of the word let, relative to how people think about blocking or banning functionality on other websites. Blocks work by the software preventing the blocked user from sending messages. The consent of the blocked user is neither required nor relevant nor expected. [23]

The complaint about not "let[ting] authors moderate" seems to not be about attempts to evade the ban (e.g., by creating sockpuppet accounts), which have not occurred, but simply that Achmiz's comments elsewhere on the website (e.g., on his own shortform) may make users feel less comfortable banning him from their posts. Whether this constitutes not "let[ting] authors moderate" can't be assessed based on authors' feelings alone; the nature of the comments matters.

The offending comments seem to be that Achmiz has argued in moderation threads that the existence of the user-ban functionality enables authors to abuse moderation tools to suppress their critics, noted on his own shortform that he couldn't add further replies to a discussion that he had already participated in before being banned, published on his own shortform a comment that had failed to post because he didn't realize he had been banned, responded on his own shortform to a comment about him, and responded on his own shortform to a comment stating that the author "was honestly curious what you [Achmiz] could mean here".

These are all reasonable comments. It makes sense for users of a website to have opinions about the features that the website supports—and the complaint that the ban feature enables suppression of criticism is just straightforwardly true (even if one might think that it has benefits that outweigh that cost). It makes sense to want to state for the record that one is no longer able to reply to a discussion that one had been participating in, lest someone infer the weakness of one's position from one's non-response. (Indeed, as we've seen, Habryka shares this concern deeply.) Having gone to the trouble of writing a comment and finding that it couldn't be posted, it makes sense to want to post it somewhere, for the effort to have not been a complete waste. If other people are talking about you and your positions in a venue where you can't make corrections (as contrasted to them silently ignoring your existence), it makes sense to want to post the corrections somewhere else.

If Achmiz were using his shortform to harass authors simply for not wanting to talk to him, that would be a matter for moderator concern, but as I've just detailed, the record doesn't support that: every instance I can find of Achmiz supposedly not "let[tting] authors moderate as they desire" is a situation where a reasonable person would have an interest in speaking up. In the case where Achmiz scrutinized another user's ban history, it was contextually relevant to the discussion of how the ban functionality is used in practice; when instructed by a moderator to stop, Achmiz complied.

The user ban functionality works to prevent authors from receiving unwanted comments on their own posts. It's not desirable to prevent banned users from writing on their own shortforms about things that a reasonable person would have an interest in writing about. No further moderator action was warranted.

Effort Ratios Are Logically Irrelevant

In a subsection on "Asymmetric Effort Ratios and Isolated Demands for Rigor", Habryka points out that critics often have an easier job than the authors they critique: asking a question is usually easier than answering it. This is true, because verification is often easier than generation. Anyone trying to make sense of others' writing is going to end up asking questions that are easier to ask than answer. The relevance is not obvious, however. What is the problem here supposed to be, exactly?

Habryka acknowledges that there's no problem "in an environment with lots of mutual trust and trade", but contends that "in an adversarial context it means that it's easily possible to run a DDOS attack on basically any author whose contributions you do not like by just asking lots of questions, insinuating holes or potential missing considerations and demanding a response approximately independently of the quality of their writing."

It would seem that some dubious implicit assumptions are being made about participants' motives. The DDoS attack analogy would carry if we construe intellectual discourse as a zero-sum game in which every unanswered criticism results in the critic winning and the author losing the same number of "points".

But that construal fails on two counts. Firstly, it shouldn't be an adversarial situation. An intellectually honest author (who wants the truth to prevail, rather than wanting their own ideas to prevail regardless of whether they're true) should welcome probing questions. If any of the questions reveals a real flaw in the author's thesis, an honest author would count that as a benefit (because they learned something), not a cost. At the same time, dumb questions that aren't worth answering or reading can simply be downvoted. That is, to the extent that the community can distinguish between good and bad criticisms on the merits, [24] determined critics of a particular author are just providing free verification labor, not conducting a DDoS attack: as it is written of the fifth virtue, "Do not believe you do others a favor if you accept their arguments; the favor is to you." (And to the extent that the community can't distinguish between good and bad criticisms on the merits, we have bigger problems.)

Secondly, even if we suppose that authors are not intellectually honest and just want their own ideas to win, no one should be thinking that an unanswered criticism represents a zero-sum transfer of status from author to critic. Insofar as we model people's discursive behavior as being governed by implied status rewards, clearly the reward for asking a probing question should be less than the reward for answering it (because the answer is more expensive and therefore requires a larger reward to elicit it). Moreover, people are pretty smart about status games and can figure this out. Probing questions rarely get high karma; thorough answers often do. No one thinks that a famous author must be worse than an unknown one because the former attracts many critics and the latter few or none. The market clears.

And given that the market clears, there's no good reason to censor probing questions: whether the questions get answered or not, discerning users will make up their own minds and vote accordingly. The only reason you would want to censor probing questions is to prevent people from noticing flaws in need of probing.

V. The Founding Values Have Not Been Refuted

Lightcone Infrastructure board member and Less Wrong BDFL Vaniver commented on the ban announcement. Vaniver characterizes the dispute over Achmiz as a cultural clash between the values of an older culture of Less Wrong (as represented by Achmiz and the present author) and a newer one (as represented by the mod team). Vaniver concedes that by the standards of the old culture, "pointing out that something is impeding the flow of information is almost enough to end the conversation on its own." He makes an analogy to the development of evidential, causal, and functional decision theories (henceforth EDT, CDT, and FDT, respectively): CDT improves on EDT by ruling out non-causal decision influences that CDT considers illegitimate, while FDT further improves on CDT by ruling in some kinds of non-causal influences as legitimate after all.

Ideological Turing Test–Passing Is Not an Argument

Vaniver then claims that the new culture is the more sophisticated one (analogous to FDT), but the justification he offers is fundamentally lacking if not absent. Vaniver writes:

I operationalize this by something like [Ideological Turing Test]-passing; I think it's generally the case that I can see the thing Said or Zack is pointing out, and in the reverse direction I mostly get the sense that they rarely see the criticism, and if they do, it's only as something that seems fundamentally illegitimate to them.

But that's not what the Ideological Turing Test (henceforth ITT) is for. You can't declare victory in a debate by citing your purportedly superior ITT results: a professor of comparative religion who thinks that all faith traditions are true in their own way would probably do better on an ITT for Islam than most atheists, but this does not imply that atheism is false!

If Vaniver could actually pass old Less Wrong's ITT test and is not just pretending to be wise, he would understand why his purported ability to paraphrase both sides' arguments is not itself an argument, because your strength as a rationalist lies in your ability to argue for a belief being constrained by whether that belief is actually true. As Eliezer Yudkowsky wrote in 2008 (before the term "Ideological Turing Test" was coined): "I cannot argue effectively for that, because I do not believe it. Or if you prefer, I do not believe it, because I cannot argue effectively for it."

Vaniver cites my "Critic Contributions are Logically Irrelevant" as a "crisp" example of the old Less Wrong ethos conflicting with the new. (As many readers inferred at the time, that post was written as a generalized reply to claims that Achmiz's alleged lack of positive contributions justifies censoring his critical comments.) "[P]eople often raise objections about commenters that don't make sense as logical criticisms. But if they aren't intended as logical criticisms, that seems irrelevant to me," Vaniver writes, with a suggestion to reread Yudkowsky's 2007 post "Feeling Rational".

I have to question whether Vaniver read past the title of "Critic Contributions Are Logically Irrelevant": I think I adequately handled the case of objections not intended as logical criticisms in the final section, "A Caveat: Critic Contributions Can Be Relevant If You Don't Care About Maximizing Correctness". The relevance of people having objections about commenters that don't make sense as logical criticisms is that if the objectors have their way, that establishes a precedent of the forum being a place where logical criticism is dismissed in favor of other concerns, which might be desirable for a social or hobby group whose only mission is for its members to have a fun time, but is contrary to the mission of an intellectual forum like Less Wrong purports to be.

Vaniver's suggestion that I reread "Feeling Rational" is a non sequitur. In that post, Yudkowsky argues against the popular misconception that rationality demands suppression of emotions; on the contrary, Yudkowsky argues, it's rational to fully feel the emotion that fits the facts. But I don't argue for the suppression of emotions. Rather, I advocate specifically against the weaponization of emotional harm as a pretext to censor intellectually substantive criticism in a purportedly intellectual venue. If those who find Achmiz's comments hurtful agreed that their pain was not cause for moderator action, there would be no dispute.

Fortitude Is a Rationalist Skill

Vaniver cites Duncan Sabien's concept of "emotional tallness", saying that he used to think of his own psychological fortitude as a rationalist virtue, but now considers it an incidental fact about himself: similarly, IQ is relevant to rationality, but isn't a virtue per se.

But what counts as a virtue is not relevant in the present context. Let us grant without question that no one is morally responsible for their IQ. Regardless, it would clearly be grossly detrimental to the site's mission to ban high-IQ users (or force them write more like low-IQ users) in order to make low-IQ users more comfortable. Or at least, while it's not literally impossible that some sort of special accommodation for low-IQ users might pass a cost–benefit analysis, a detailed case would have to be made for the surprising claim that such an intervention would benefit the site's mission.

The same goes for "emotional tallness." It's a trait relevant to rationality. (Limits on what kind of information you can process without being psychologically impaired imply limits on your ability to make rational decisions in situations where that information is relevant.) A site dedicated to advancing the art of rationality shouldn't go out of its way to silence people with high values of the trait in order to accommodate people with low values of the trait.

It's not clear why Vaniver (or Sabien) uses a height metaphor instead of a word like fortitude or resilience—or if a metaphor for bodily attributes was desired, strength would be the standard one. The strength metaphor would be more fitting insofar as it comports with the reality that it's possible and desirable to increase one's psychological fortitude. It's possible to want to become stronger, and succeed at training it—notably unlike height or IQ, which are largely immutable given current technology. One has to wonder whether the choice of height as the metaphor for mental fortitude is intentional in some sense, meant to protect the self-image of those who don't want to become stronger by making it look like they don't have a choice. [25]

Vaniver concludes by linking to a 2013 comment thread in which he criticizes Yudkowsky's advocacy of deleting comments that negatively reinforce contributor efforts. The present-day Vaniver writes that he "was missing the concept of emotional tallness, then" and that he "can see the younger me levelling a similar criticism at the mod team now."

But the mere observation that present-day Vaniver possesses "the concept of emotional tallness" doesn't explain what the younger Vaniver was allegedly wrong about! [26] In the 2013 thread, the younger Vaniver testifies that "responding appropriately to criticism is a skill that takes development" and that he was "pleased with how far [he had] gotten." Adopting the novel metaphor of "emotional tallness" rather than just saying fortitude doesn't refute the younger Vaniver's claim that it's possible and desirable to learn to respond appropriately to criticism. (Alternatively, if the claim is that it was possible for the younger Vaniver to learn because he was "emotionally tall", but that "emotionally short" people can't learn, it remains to be argued for why Less Wrong should accommodate emotionally short people more than it currently accommodates low-IQ people.)

I, too, have grown and changed a lot over years of life experience, but in the many, many cases where I now disagree with something I used to believe strongly, I expect to be able to explain what arguments and evidence changed my mind in words that my younger self would find at least somewhat compelling. I think that if I couldn't, that would be a troubling warning sign that my current belief was a product of corrupting social pressures rather than wisdom.

In terms of Vaniver's decision theory metaphor, as a convert from CDT to FDT, I expect to be able to give my CDT-adhering ex-comrades a hard time by challenging them to explain why they don't get the big money in Newcomb's problem. If I just said, well, the culture has moved on, people used to believe in CDT, but everyone who's anyone is into FDT now, that would not be compelling.

The fact that Vaniver, while claiming to have transcended the childish concerns of the old culture of Less Wrong, shows so little interest in arguing his views to those who do not already share them is not to his credit, nor to the new culture's.

VI. What I Think Is Actually Going On

In the preceding sections, I have argued that the site-wide ban of Achmiz was unnecessary, contradicts the stated and founding values of Less Wrong, and rests on a slew of factual misrepresentations. The persecution of Achmiz is so bizarre, so indefensible on the intellectual merits, that it seems insufficient to merely argue that it was wrong: the fact that it happened at all cries out for an explanation. Habryka, Vaniver, and the rest of the LessWrong 2.0 team are not stupid people. What could possess them to behave like this—to expect to get away with it, and actually get away with it? Here I am less confident, but a post on the topic would not be complete without at least an attempt to theorize. [27]

Achmiz's Detractors Accuse Him of Passive-Aggression

A recurring theme in the accusations against Achmiz is that Achmiz is allegedly being deceptive, passive-aggressive, or potentially even "gaslighting" [28] by disrespecting other users and then denying that disrespect was intended. Pace, for example, characterizes Achmiz as "holding extreme disdain and disrespect for interlocutors while being committed to never saying anything explicitly or even denying that it is the case."

I can't speak for Said (and he is no longer permitted to speak for himself), but if I may be permitted to speculate, I think what's going on here is that Said is polite. He's a polite person! The reason he's "committed to never saying anything explicitly" is because he believes that personal insults would be contrary to the etiquette of a civilized discussion forum.

What he does do, and what provokes the accusations of dishonesty, is ask pointed questions such that some people will (rightly or wrongly) infer that he's thinking negatively of someone. The complaint seems to be that this is worse, because passive-aggression is purportedly worse than overt aggression. Habryka writes in the comments that he "think[s] [he] could handle disdain fine, if it was carried openly, and could be argued with."

... But It's Hard to Believe That Open Disdain Would Be Better

I really have to question whether openly carried disdain would be better. For myself, I often have negative judgments of people on this forum, but I usually don't articulate them in my comments, partially because that would be rude, but more fundamentally because it would be off-topic. This is a website for discussing topics like probability theory, the philosophy of science, the psychology of decisionmaking, or artificial intelligence. "Which users of this website does Zack M. Davis think are phonies" doesn't make the cut. That's not what people are here to read.

Am I ... somehow wrong about that? Have I been deceiving and gaslighting people this whole time by engaging with the substance of their ideas and trying to keep my thoughts about how stupid and dishonest they are to myself?

Well, unlike my friend Said, I am not a polite person and am not committed to never saying anything explicitly nor denying that it is the case, so if it would help, I'm happy to clarify that I hold extreme disdain and disrespect for the entirety of the Less Wrong moderation team and a decent fraction of the userbase for general reasons that are neatly exemplified by the concrete case of their relentless persecution of Said Achmiz. If it would help, I could elaborate on this for some time using much stronger language.

But I don't think it would help. In June 2025 moderation discussion, Habryka claimed that "[i]t is indeed not a norm on LessWrong to not express negative feelings and judgments" and that "the issue of contention" with Achmiz was "passive-aggression, not straightforward aggression." He then refused to explain what forms of "straightforward aggression" would be permissible, claiming that the question was unmotivated.

But the motivation is obvious: if Achmiz's on-topic, intellectually substantive criticism is not permitted on the grounds that he's allegedly cloaking negative judgments in passive aggression, the implication would appear to be either that such judgments should be made with "straightforward aggression", or that negative judgments are not actually allowed.

It's hard to see what straightforward aggression has to recommend it, however. When an author writes of having "heard of some really awful real-world cases" of some phenomenon and Achmiz replies, "Could you cite some such cases? I think it would be quite instructive to examine some case studies!", it's fair enough to suspect that Achmiz suspects that the author is bluffing and doesn't really have any examples—but what else is he supposed to do but ask? To instead say "I think you're bluffing" (in order to be straightforwardly rather than passively aggressive) just seems obviously worse—not only because it's more unpleasant to receive an accusation than a question, but more importantly because the author might not be bluffing! [29]

In any case, it does not appear to be the case that straightforward aggression is allowed in practice. In July 2025 moderation discussion, I wrote that "I think grown-ups should be able to shrug [...] off" other people disapproving of their behavior "without calling for draconian and deranged censorship policies" and that "[t]he mod team should not be pandering to such pathetic cry-bullying."

Habryka replied:

This is indeed a form of aggression and scorn that I do not approve of on this site, especially after extensive litigation.

I'll leave it on this thread, but as a concrete example for the sake of setting clear guidelines, strawmanning all (or really any) authors who have preferences about people not being super aggro in their comment threads as "pathetic cry-bullying" and "calling for draconian and deranged censorship policies" is indeed one of the things that will get you banned from this site on other threads! You have been warned!

First of all, Habryka is misusing the term strawmanning, [30] but more importantly, I meant what I said, and I'm happy to explain it in different words if the words I initially used in haste are inadmissible on account of their scornful tone. At issue in the comment in question was Achmiz's stance on obligations to reply. I'm saying that people should be able to tolerate the fact that Achmiz will think less of them if they don't answer questions that Achmiz thinks are productive. If someone's mere personal opinion that questions should be answered is not allowed to be expressed, then that's a harsh and ill-considered policy whose advocates are not behaving admirably because the policy is using people's emotions as a rationale for demanding changes in others' otherwise acceptable behavior.

Now that I've rephrased it to use less inflammatory words, is that still a banned opinion? Is that a thought that LessWrong 2.0 users are not just not allowed to say, but not allowed to think and continue using the website? Because even if I don't say it, I'm probably going to write in such a manner that you'll infer that I'm thinking it, which is all anyone had on Achmiz—and in any case, I'm telling you right now that I'm thinking it.

We would appear to have reached a contradiction: it can't simultaneously be the case that passive aggression is grounds for banning, and that straightforward "aggression and scorn" is grounds for banning, and that it is "not a norm on LessWrong to not express negative feelings and judgments."

The paradox is resolved if we posit that the relevant distinction is not really about aggression, but rather about acceptable targets. Habryka writes that "it is often appropriate to be rude and offensive, and often inappropriate[,]" that "LessWrong has a culture, a lot of which is determined by what things people are rude and offensive towards[,]" and that "[o]ne of [his] jobs as a moderator is to steer where that goes."

On the same day in July 2025, Habryka also wrote that he has "mostly procedural models about how LessWrong should function, including the importance of LW as a free marketplace of ideas[.]" I think it's pretty weird to write both of those things on the same day.

It would appear that the complaint about passive-aggression is a rationalization that can't be taken literally. The only way I can make sense of the observed behavior is to posit that the real offense is possessing [31] negative feelings and judgments towards unacceptable targets. If the person with negative feelings is polite, that gets framed as "passive aggression," but no one honestly thinks that being less polite would be an improvement.

The Real Rule Is to Cover for the Ingroup

So if the complaint about passive-aggression is a rationalization, what are Achmiz's detractors really after? What determines the acceptable targets?

It can't be that they're trying to silence all criticism whatsoever—and not even only because they couldn't get away with that (without third parties noticing). These people aren't insane enough to not understand that being told why you're wrong is useful for becoming less wrong.

Rather, going off of comments that explicitly talk about, for example, "a general trade off between authors' experience and improving correctness", it seems like they're trying to strike a balance between ego gratification and being less wrong. You can get away with making on-topic, intellectually substantive criticisms, but only if you take care to manage ingroup members' emotions and cover for their reputations—as if to put a hand on the other's shoulder and say: hey, buddy, while I'm disagreeing with the literal thing you said, you didn't do anything wrong; I'm still on your side; I'm not judging you; I still care about you.

Said doesn't send those signals. Said doesn't make people feel like he cares about them and isn't committed to believing that they did nothing wrong. He wasn't unique in delivering harsh criticism—other people do that sometimes, too. What distinguished Said was his inflexibility, which is to say incorruptibility. Everyone else knows when to flinch and pull their punches—that is, to lie or prevaricate—when the collective vibes say enough is enough. That's why he had to go. [32]

The people trading off correctness for author experience don't seem to realize why others might consider the trade illegitimate and resist participating. Pace writes disapprovingly of "any commenter being able to demand answers to questions at the risk of the post-author's status[.]"

Well, yes. How did Pace think status worked, exactly? "Status" is just other people's beliefs about you. People are going to update their beliefs about you based on their observations of your behavior. Your status is always at risk. [33]

In a similar vein, Habryka expresses a desire for freedom from judgment while struggling to explain why he wrote such a detailed ban announcement, rather than just admitting [34] that he's personally fed up with Achmiz without trying to justify it:

And then there is also a third thing, which makes me want to be particularly detailed and concrete and give lots of arguments and models here, which is more relevant for Said in-particular. And that thing is the feeling of being insane, of at the same time really feeling like something hurtful and harmful is happening, while tons of people around you are denying that any such thing is going on, is at the core of a lot of the complaints I have received over the years about Said, and is also at the core of my own bad experiences with Said. More specifically the thing where I read a top-level that I cannot help but read in a sneering voice, dripping with judgment, pointing a finger at me or the author in a way that summons judgment and punishment, but which as soon as its [sic] called out disappears, denies it ever existed, or keeps slipping away, redefining itself in endless circles.

And I don't know, maybe any attempt to disentangle the things that are going on and to try to make some kind of compelling demonstration that I and others are not insane is doomed.

Habryka hopes that "maybe [he]'ll be a better writer in a few years and can get it across more easily." He won't. Habryka is already a decent enough writer to explain what he means; the reason that any attempt to demonstrate that he and others are not insane is doomed is because the thing that he means is completely unreasonable. People don't owe you a lack of negative judgment. [35] You can't force people to convincingly pretend to like you.

Habryka seems to think he can and should, because it would improve his job satisfaction. In a comment on "grappl[ing] with the concept of 'contempt of court'", Habryka writes that "[t]he cost of someone being a dick to moderators is indeed very high" because "because moderator energy is often the limiting factor for a functional forum", [36] but that "Said is a huge outlier in how much he was contemptuous of any attempts to moderate him, so it might just be less of an issue in the future"—apparently interpreting the phrase "contempt of court" as literally referring to the emotion of contempt. This is a cartoonish misunderstanding of the legal concept. It's a term of art that refers to behavior that interferes with judicial proceedings, like ignoring a subpoena. It's not about protecting judges' morale, such that they don't quit their jobs! [37]

Similarly, Habryka writes that he "would appreciate some courtesy to keep discussion to the principles and decision-level instead of critiques of [his] personal behavior, as indeed much of the cost of moderation is measured in having any moderation-adjacent action be torn apart and be requested to be justified or defended." [38]

I reply: what principles? Every principle I can find in the ban announcement is either immediately contradicted ("Every individual owns their own judgment of you") or grounds out in the subjective "by the lights of the moderators." Asking me to keep discussion to the principles amounts to asking me to pretend that the decision was made on the basis of principles, when I see very little reason to believe that.

Preventing "Relitigation" Is a Pretext for Suppressing Dissent

In the ban announcement, Habryka writes that users who disagree with the moderators can "make a post about how you disagree with some decision we made." He also writes:

I think I should very rarely prevent whoever is being affected by a moderation decision from defending themselves on LessWrong. For practical reasons, I need to limit the time I personally spend replying and engaging with their defense, but I think it makes sense to allow the opposing side in basically all cases like this to publish a defense on the site, and do at least some to signal-boost it together with any moderation decisions.

This would be great if Habryka were telling the truth that users would in fact be permitted to "make a post about how [they] disagree with some decision [the moderators] made." However, when user Sting made a question post, "Which top authors did Said Achmiz drive away?", Habryka unlisted it (making it only available by direct link) and demoted it to a comment on the ban announcement post, stating, "I do really want to avoid this becoming a recurring topic with lots of disconnected top-level posts doing relitigation", and "I think this is the right choice for stuff like this".

When Sting pointed out that their question met the Personal Blogpost criteria, Habryka replied, "To be clear, we relatively routinely merge and move comment threads (as is common practice on most internet forums)."

The claim that the unlisting was a "relatively routine[ ]" operation is questionable: while I'm aware of some three cases where a derailed comment thread was split onto its own post, an investigation of unlisted posts using the LessWrong 2.0 GraphQL API did not turn up any other cases where a human-authored Personal post from an established user was manually unlisted over the author's objections. [39] The question had achieved 48 karma in 19 votes, suggesting that the voting userbase thought it was an appropriate and high-quality post for the site. [40]

I pointed out that when users make posts on topics that have already been discussed before, the moderators do not demote subsequent posts to comments. [41] What made this topic different?

Habryka replied, in part:

The default equilibrium of internet discussion like this is that stuff gets eternally relitigated because people show up who don't want to read the previous context and are looking for some kind of drama, or get drawn in by someone else presenting some isolated facet of the context. The whole reason why I spent 60+ hours writing the previous post is to avoid that exact dynamic. Allowing lots of top-level posts will inevitably then cause me to have to spend another 100+ hours on this, which I do not want to do. Indeed, in this case there are a lot of comments on the banning post that directly address questions in this post, and I strongly expect any discussion situated in that context to go a lot better.

At the time, I pointed out that "have to" isn't warranted: other people have an interest in discussing the ban amongst themselves independently of whether Habryka thinks it's worth his time to reply (to which Habryka agreed).

Separately from the semantics of "having to" respond, however, it's not clear to me that Habryka understands what a stunning admission the phrase "which I do not want to do" is.

The personal convenience of moderators is not a legitimate reason to limit the visibility of content! This should be quite clear if we consider any other topic. It would be obviously corrupt if the moderators wrote a post announcing that the "one-third" answer to the Sleeping Beauty problem is correct and demoted any subsequent posts taking the "one-half" position to comments on the announcement that the answer is one-third.

If the moderators in this hypothetical scenario were to defend their actions on the grounds of their "strongly expect[ing] any discussion situated in that context to go a lot better," one would hope users would ask the obvious question: better for whom? As a matter of human psychology rather than rationality, it makes sense that someone who had sunk 60+ hours arguing that one-third is the correct answer might feel personally annoyed if someone else were allowed to make a top-level post arguing for one-half which emphasized different facets of the context: if readers found the second post persuasive, that might be embarrassing for the author of the first post and even subject them to social incentives to spend even more time replying, which they might not want to do.

But there's little reason to think the first author's personal annoyance has anything to do with what would make the discussion go better in terms of readers having accurate beliefs about the Sleeping Beauty problem. For readers who haven't yet made up their mind on whether the answer is one-third or one-half, the obviously fair policy is that both sides should be allowed to post under the same rules, rather than one side's arguments only being allowed in a context chosen by the other. Indeed, if one side insists on asymmetric rules, that's itself suspicious. [42]

In the comments on Sting's question, I asked if I would be allowed to make a top-level post commenting on the ban decision. In his reply, Habryka wrote that it's "quite important to provide context as to the efforts already exerted on the topic and associated lack of response, as the norms by which many, if not most, readers will interpret the lack of a response is as some kind of admission of wrongness", such that it's "bad form to not provide it in the top-level post[.]"

I have no idea what kind of context Habryka is expecting me to provide. He wrote a post justifying his ban decision; I'm writing a post explaining why both the ban and its justification are bad. In doing so, of course I've linked to the ban announcement and extensively quoted its language: I had to in order to argue against it!

If it helps, I'm happy to clarify that I don't think readers should think I'm right because Habryka might not reply. (He's been very clear that he didn't want to spend more than 10 hours of his time responding to comments on the ban, and I'm happy to provide that context.) I think readers should think I'm right because I've successfully argued in the preceding sections that the ban justification is riddled with logical errors and factual misrepresentations. Whether Habryka chooses to reply or not is his prerogative; in either case, readers will judge on the basis of whatever information is available to them. Similarly, that I chose to write this critique was my prerogative. If the things I have to say are personally inconvenient for someone else, that's not my problem.

If it helps, I'm happy to acknowledge that Habryka already spent 60+ hours on the ban announcement, which evokes sympathy for him not wanting to spend any more time on this affair—but I actually think that context that helps my case, not his. If he spent so much time trying to craft a convincing ban announcement, and ended up with such a poor result (such that I could write so many words credibly critiquing it), [43] I think that speaks to the underlying weakness of his case.

VII. Conclusions and Recommendations for Readers LessWrong 2.0 Has Forfeited Its Intellectual Legitimacy

I began by saying that this post is not really about Said Achmiz. I meant it. I am not writing this post to the moderation team to beg them to reverse the Achmiz ban. Firstly, that's obviously not going to happen. But secondly, on the object level, the ban itself is not of tremendous importance. If Said had found a new hobby and stopped commenting of his own accord, it would probably take me a while to notice. I wouldn't be upset, and I certainly wouldn't write a 18,000-word post about it.

Rather, I am writing to the world in defense of a principle that I've spent the last decade of my life [44] fighting for: that the truth matters, and specifically, that the truth matters more than people's feelings. If someone makes an intellectually substantive comment on an intellectual discussion forum, it makes sense to object if it's false or irrelevant. It does not make sense to object that the intellectually substantive comment makes you feel bad, because whether the comment makes you feel bad has no bearing on whether the comment is true or relevant. As a grown-up on an intellectual discussion forum, it's not other people's job to manage your feelings.

This website was founded from a sense that more is possible—that a sufficiently detailed understanding of the laws of systematically correct reasoning would enable cognitive feats that would both improve lives today and help humanity confront the machine intelligence transition.

It was understood, at the time, that aspiring to systematically correct reasoning was an unnatural ambition that would require effort and probably psychological discomfort. As it was written in the founding texts:

People don't think about the real weak points of their beliefs for the same reason they don't touch an oven's red-hot burners; it's painful.

Correspondingly, the founding texts advised confronting the pain rather than fleeing from it:

Better to think of such rules as, "Imagine what a skeptic would say—and then imagine what they would say to your response—and then imagine what else they might say, that would be harder to answer."

Or, "Try to think the thought that hurts the most."

That is, in the Sequences era, it was understood that even if you don't have any persistent and demanding critics in real life, you should try to simulate them. It was understood that the process would be unpleasant, but that's it's unpleasant because it involves fighting the human instinct to rationalize, and that the reason to do it anyway is because you care more about being less wrong than about having pleasant emotional experiences.

In the comments on the ban announcement, a commenter pleads for posts from less prominent authors to not be criticized, because "[t]here are selection effects on who stays in the community under these conditions."

Indeed, selection effects are important. In particular, a Less Wrong that bans Said Achmiz is selecting for bullshitters who don't care whether what they say is true or false. Because people who care about whether they're telling the truth want to be held accountable! They want to be criticized! [45] They want people to ask questions about their work! Probably, they want people to ask questions about their work even if they personally won't have time to answer, because they understand the concept of a public comment section in which other people might be able to answer!

Once, people who cared about whether what they say is true and who wanted to become stronger could look to Less Wrong as a beacon transmitting best-in-class information on the topic of that inhuman ambition—a place that really forces you to get your arguments together.

Well, the writing's been on the wall for a while, but the Achmiz ban should make it clear to everyone that that era is over now. It's not just that the culture doesn't train people to think the thought that hurts the most and to imagine what a skeptic would say that would be harder to answer. It's gotten to the point that when a human volunteers to do that work for free, the reigning powers ban them specifically for being too hurtful and hard to answer! Those who would rather the beacon be bright than correct have completed their annexation of Earth's global rationalist community and converted it into a hip Bay Area party-and-AI-discussion scene—a great improvement from their perspective. They're not going to give the beacon back.

What now? As always, that's up to everyone for themselves, but I have a few thoughts.

LessWrong 2.0 Is a Website, Not a Culture or an Authority

In the ban announcement, Habryka invites users dissatisfied with how the site is being run to loudly quit, but adds that "not all things are so bad as to make it the right choice to stop using LessWrong altogether" and that he doesn't believe that agreeing with the LessWrong 2.0 team "on all fronts" should be "a requirement for thinking LessWrong is good for the world."

I think it's wrong to conflate using a website with thinking that the website is good for the world. Internet discussion platforms pose a coordination problem: the main reason to use a website is because everyone else is already using it. I use Twitter (officially called X since 2023) as a discussion platform and as a marketing channel for my writing. That doesn't mean I think Twitter is unambiguously "good for the world." On the contrary, I think Twitter has a lot of negative effects on the world (in the form of sapping attention spans, spreading misinformation, increasing political polarization, &c.). It's just that me unilaterally forgoing Twitter wouldn't make a dent in the harms it does. In the absence of a critical mass coordinating to switch to some superior website, I'm trapped—but because I know that I'm trapped, I can relate to the trap in a sane way.

Namely: Twitter is infrastructure—not a community, and definitely not any kind of intellectual authority. It makes sense to not directly link to Substack in one's Tweets in order to avoid being algorithmically punished, but it should be clear that this is a matter of complying with Elon Musk's whims because he took over the infrastructure (legally, by buying it) and has the hard power to do what he wants, not because there's actually anything wrong with Substack. If Elon Musk were to proclaim that X is a culture, a lot of which is determined by what things people are rude and offensive towards, and that a lot of his job as CEO is to steer where that goes, I think self-respecting Twitter users would laugh in his face. Owning a website does not make you a religious authority.

I think students of human rationality should relate to LessWrong 2.0 the same way as they relate to Twitter—as infrastructure—and relate to the moderation team the same way they relate to Elon Musk. [46] I think LessWrong 2.0's suppression of intellectually substantive discussion in order to protect contributor feelings is doing harm by deceiving the world about what rationality even is, but it still makes sense to use the website as a discussion platform and as a marketing channel for our writing, because we're trapped: coordinating to move to a new website is hard enough that—for now, if things don't get worse—it makes sense to stay put and keep doing our thing under Musk's terms of service. (Although it's probably advisable to keep your own blog that you control and merely linkpost to this website as a marketing channel, rather than trusting LessWrong 2.0 to host the canonical copy.) [47] In that sense, I agree with Habryka when he writes that "for the rest of us" "not that much has to change."

However, unlike the case of Twitter, I worry that too many students of human rationality aren't modeling themselves as trapped and therefore aren't relating to the trap in a sane way. I worry that a lot of people are clinging to an implicit assumption that there is a healthy and trustworthy "rationalist community", such that trying to gain the approval of "the community" and its leaders amounts to the same thing as trying to be rational.

That was never going to be real. Those who seek the Way have to invent it for themselves.

Speak the Truth, Even If Your Voice Trembles

To the people who care about whether what they say is true, I advise courage. The people playing ego-flattery and reputation-protection games will hate and fear you, as they hated and feared Said Achmiz. When one of their own declares that beliefs about dragons don't require evidence, they might let it slide when you contradict them and say that all beliefs require evidence. (Mere disagreement doesn't threaten their self-image, as long as they can frame it as coming from a place of mutual respect.) But they won't take it lying down if you try to further point out that the most plausible reason why someone would claim that beliefs about dragons don't require evidence is in order to protect false beliefs about dragons. (That threatens their marketing story about being a community of collaborative truthseekers.) They'll say you're sneering, heaping scorn and social punishment on the virtues and values that are the lifeblood of the community by their lights.

Don't be intimidated. You don't have to comply with their demands to soften, obfuscate, and lie. Just keep speaking clearly on whatever channels are available to you.

To be clear, it's not that these people can't hurt you in any way. They can slander you and silence you on the platforms that they own and ban you from their events.

But in the long run, their power is self-limiting. In a conflict between people who care about whether what they say is true, and people who say whatever they need to say to achieve their immediate social goals, the latter will tend to achieve their immediate social goals—but only by flooding the zone with nonsense that doesn't stand up to basic logical scrutiny and trying to prevent anyone from applying basic logical scrutiny (perhaps under the guise of preventing "relitigation" or "fights of attrition"). They can martyr the occasional principled person when they have an overwhelming local resource advantage, but their reputation protection scheme will never be safe unless they can successfully purge everyone capable of applying basic logical scrutiny—and they need our eyeballs for their platforms and our money for their hotels. Though trajectories are harder to predict than endpoints, it's reasonable to expect that the charade can't last forever.

  1. A similar process played out again during the construction of the present post. The original title of the post was to be "Said Achmiz in His Felon Stripes Stands Far Above You Now", from an adaptation of Voltairine de Cleyre's impassioned 1890 defense of the jailed proto-feminist publisher Moses Harman that I wanted to use in the post:

    If, beyond these, there are those here tonight who have ever forced intellectual deference from a peer, those who have prostituted themselves in the name of Charity, those who have brought confused, immoral, or unexamined arguments to the light, without the means of defense for them, and yet will go from this post and say, "Said Achmiz is an unclean man—a man rewarded by just punishment," then to you I say, and may the words ring deep within your ears until you die: go on! Drive your sheep to the shambles! Crush that old, uncouth man beneath your Juggernaut! In the name of Niceness, Community, and Civilization, do it! In the name of the Chicagoan who preached the twelve virtues, do it! In the name of Clear Thinking, Clear Communication, and Collaborative Truthseeking, do it!

    [...]

    Would you smile to see him dead? Would you say, "We are rid of this obscenist"? Fools! The corpse would laugh at you from its cold eyelids! The motionless lips would mock, and the solemn hands, the pulseless, folded hands, in their quietness would write the last indictment, which neither Time nor you can efface. Kill him! And you write his glory and your shame! Said Achmiz in his felon stripes stands far above you now, and Said Achmiz dead will live on, immortal in the race he died to free! Kill him!

    At first it was going to be an epigraph—to the last section, if I couldn't get away with making it an epigraph to the whole post without getting dismissed by an unsympathetic audience. Said thought it was "too overwrought" and that it remained to be seen if a later draft might "earn it." I couldn't bear to cut it and ended up using it as a peroration in a later draft. Said didn't buy it, saying that his "previous comments about the tone stand" and that it "muddle[d] the concluding section." He was right. ↩︎

  2. Although curiously, Habryka's detailed description of the sneer attractor doesn't particularly seem to resemble Achmiz's comments in any aspect other than them both involving harsh criticism. Habryka writes that "[t]he key component of making good sneer club criticism is to never actually say out loud what your problem is." But Achmiz is quite willing to say out loud what his problem is, and is clearly committed to answering clarifying questions at length if his initial complaint was unclear. ↩︎

  3. The Zionism analogy for re-centralizing a dispersed community writes itself given the term "Less Wrong diaspora"; no particular geopolitical implications are intended. ↩︎

  4. My first comment—while the original Sequences were being written on Overcoming Bias, more than a year before Less Wrong was founded—dates to December 2007; I attended the first Overcoming Bias meetup in February 2008. Achmiz's first comment dates to May 2010.

    In contrast, Pace "read The Sequences [...] in 2011", Arnold's account dates to September 2010, Ruby Bloom "found LessWrong in 2012", Ronny Fernandez's account dates to 2011, Vaniver's account dates to October 2010 after he "found [Less Wrong] through a link to HPMOR on the xkcd forums", and Habryka's account dates to 2015 with Vaniver confirming that Achmiz has "been here longer" than Habryka. ↩︎

  5. User-level bans or ignore functionality—and indeed, karma voting itself—are impartial in the sense that everyone has the same tools: if Alice doesn't like Bob, she can downvote and ignore his comments or ban him from commenting on her own posts, and vice versa. The impartial tools help people share the website without the moderation team needing to take sides. ↩︎

  6. It's notable that Yudkowsky's articulation of "Coherent Extrapolated Volition" includes the clause "where our wishes cohere rather than interfere", such that "[a] minor, muddled preference of 60% of humanity might be countered by a strong, unmuddled preference of 10% of humanity." Analogously, a majority of rationalists mildly disliking having to share the website with us shouldn't outweigh our strong desire to share it.

    The coherent-wishes clause also illustrates why objecting to banning Achmiz is not tantamount to demanding no moderation whatsoever. No established users are writing 18,000-word manifestos complaining about the rejected posts slushpile being censored, because our wishes cohere on censoring the slushpile in order to maintain the signal-to-noise ratio. ↩︎

  7. In the comments on the ban announcement, Vaniver writes that the correct response to differences between cultures of aspiring rationalists is:

    for us to shake hands and have different websites with different target audiences (who are drawn to different targets). Otherwise we'll just be locked in conflict forever (as happens when two control systems are trying to set the same variable to different reference values) and this doesn't seem like a productive conflict to me.

    Does Vaniver assume that people on one side of the conflict aren't even interested in reading anything by people on the other side? (One would have imagined that the eleventh virtue of scholarship would recommend being well-read.) Why not just share the website, since we're not going to run out of paper? ↩︎

  8. See the subsequent section, "User-Level Bans Are a Sufficient and Less Intrusive Remedy". ↩︎

  9. An analogy could be made to Habryka's 2024 complaints about the effects of Open Philanthropy (now Coefficient Giving) on AI safety work in which he wrote that "[i]n as much as [Open Philanthropy] is the most powerful actor in the space, the original geeks are being thoroughly ousted." Coefficient Giving might well reply, "our money, our rules"—that it's ungrateful nonsense to complain about how Dustin Moskovitz chooses to spend his own money—but this would seem to be missing the point. ↩︎

  10. The rhetorical device of juxtaposing seemingly contradictory passages from an author with minimal additional commentary as a prompt for the author to clarify, is not a unique innovation of Achmiz's. In a June 2023 email discussion, I received an email from Hoffman himself that quoted two parts of my previous message, with the comment, "?????". The meaning was quite clear. I did not take offense or interpret Hoffman as claiming that I deserved to be punished. I just said, "Thanks for pointing out the tension between those two passages," and explained what I meant.

    And what Achmiz dishes out, he can take. In a May 2026 discussion of the æsthetic merits of the musical Hamilton, I perceived an apparent contradiction between two things that Achmiz had said and quoted them with the comment, "Hmm ..." He didn't interpret it as claiming that he deserved to be punished, either. He just said, "Yes. Exactly," and explained what he meant. ↩︎

  11. As an accomplished baker, Achmiz is in a credible position to tell. ↩︎

  12. "Top authors" receive lots of comments by virtue of being "top", and not all comments are good; it would be strange if one commenter merely asking pointed questions—not being abusive or overtly insulting—was somehow so offensive that "many top authors" found him in particular a "top reason" to not use the site at all, particularly given the existence of user-level ban functionality. ↩︎

  13. Not by name, but the discussion in the comments section makes the inspiration quite clear. ↩︎

  14. Presumably Habryka meant the history of the LessWrong 2.0 reboot. Less Wrong launched in February 2009. ↩︎

  15. In May 2026, I also emailed Eliezer Yudkowsky to attempt to corroborate Habryka's claim about Yudkowsky's opinion of Achmiz, but Yudkowsky could not be reached for comment. I'm only aware of one direct interaction between Yudkowsky and Achmiz on Less Wrong, a 2022 thread about the faithfulness of Yudkowsky and Lintamande's collaborative fiction Planecrash to Pathfinder rules as written. ↩︎

  16. Habryka also writes, concerning Benjamin Hoffman, that the conflict between Achmiz and Hoffman on "Zetetic Explanations" and another conflict between Hoffman and Duncan Sabien "seemed like roughly the thing that caused [Hoffman] to leave"—but those incidents were in May 2018 and August 2018, respectively, whereas Hoffman went on to make more than 20 posts (including two Curated posts) between August 2018 and June 2020 (before a gap until April 2022). The lag from putative stimulus to response of almost two years calls into question the reliability of Habryka's ability to perceive what is "roughly the thing that caused" what. ↩︎

  17. The unreliability is implied by probability theory. The evidence of a report of a discouraged author is determined by the likelihood ratio, mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } . If not-discouraged authors are incorrectly reported as discouraged, it follows that any given report is less meaningful. The conclusion is agnostic as to the cause of the false reports: we don't actually need to know whether the false statements were due to dishonesty, ignorance, poor memory, weak reading comprehension, &c. ↩︎

  18. Habryka links to the clarifying comment in footnote 11 as "further details on what Said means by 'responding'", but I don't think this suffices to clarify Achmiz's true views to readers of the ban announcement. ↩︎

  19. In personal correspondence with the present author, Achmiz re-stated his position again:

    To put it concisely: one has an obligation to reply to questions/criticisms/etc. in the same way, and to the same extent, as one has an obligation to do any of the other things that we consider to characterize good writing, good thinking, and good communication. If you write a post on LessWrong, you have an obligation to reply to criticism and questions ... and an obligation not to mischaracterize your opponents' views, and an obligation not to lie about the facts, and obligation to acknowledge evidence against your position, and an obligation to avoid blatantly misleading claims or descriptions ... and so on in this vein. And if you fail to do any of these things, especially if this failure is habitual and/or persistent in the face of un-rebutted criticism, then readers of your writings will judge you negatively for it, and rightly so.

    ↩︎
  20. More precisely, to assign probability distributions with a worse proper score. ↩︎

  21. The distinction being that auto-collapsing comments from an ignored user doesn't prevent other users reading the comments. Discord's block functionality in shared channels works similarly. ↩︎

  22. Modulo the karma limits, but the behavior of the mod team since the feature was introduced seems to imply much more concern about the harm done by unwanted comments than by frivolous bans: I'm not aware of anyone punished or warned for banning too many users. This suggests that the karma limits could be dropped or revised without issue. ↩︎

  23. The point of a unilateral block feature is precisely that you shouldn't need someone's consent to get them to stop talking to you. ↩︎

  24. And to the extent that readers are taking into account the selection effect of who the determined critics target. ↩︎

  25. Thanks to Said Achmiz for this observation. ↩︎

  26. Vaniver also writes that his younger self was "missing the point about the conversation quality being worse because of indirect effects." This is too vague for me to address without further explanation; I don't know what purported indirect effects Vaniver is referring to. ↩︎

  27. Throughout this part, claims about motives should be understood in a functionalist sense: an agent that systematically chooses actions that result in X can be said to "want X" without necessarily consciously verbalizing, "I want X." ↩︎

  28. In the comments on the ban announcement, Habryka writes that he "feel[s] some temptation to try to restore some of [his] sanity by providing compelling demonstrations of what has felt like gaslighting to me", although that that wasn't the primary motivation for the ban announcement post. ↩︎

  29. A question can simply be answered, but accusations need to be defended against—and if successfully defended against, the accusation is recognized as unjust. This makes accusations a uniquely terrible way to elicit examples in situations where examples would be helpful: the elicitation succeeds if and only if it shouldn't have happened! ↩︎

  30. Strawmanning is mischaracterizing a speaker as having said something other than what they did. Hostile descriptions of someone's behavior (that are understood by readers to not represent that person's opinion of their own behavior) are not strawmanning. In context, I'm obviously not saying people are advocating policies that they think are "draconian" and "deranged", or even necessarily constitute censorship as they would use that word: the negative connotations of censorship make people reluctant to use it to describe instances of information suppression that they approve of. Habryka seems to understand this point in other contexts, as when he writes that "trying to force every description of every organization to pass their [Ideological Turing Test] is bad." ↩︎

  31. In light of the ban announcement's declaration that "You are at least somewhat responsible for the subtext other people read into your comments, you can't disclaim all responsibility for that", I think possessing is the correct verb here, not expressing. Not saying anything negative won't save you if people can infer that you're thinking it. ↩︎

  32. In 2018, Benjamin Hoffman wrote that Achmiz was "one of the few people to consistently call out" "a tendency to conflate the stated ambitions and actual behavior of ingroups like Rationalists and EAs, when we wouldn't extend this courtesy to the outgroup, in a way that subtly shades corrective objections as failures to get with the program."

    Achmiz's first moderator warning on LessWrong 2.0 was for a comment answering the original post's question about the lack of popularity of the Center for Applied Rationality's "Double Crux" technique, in which Achmiz stated that the technique's origin at CfAR was "an anti-endorsement."

    The warning stated that "A simple 'I think that...' or '...for me' would have done a great deal to resolve" the problem of Achmiz's answer "generat[ing] a lot of unnecessary friction"—but as Achmiz points out, the comment in question already begins with "My take". Was that not enough of a personal-opinion-only indicator, such that Achmiz should have somehow known to also append "for me" to the "anti-endorsement" claim? In the comments on the ban announcement, Habryka cites the fact that the original post had a "Thinking out loud" epistemic status line as evidence that Achmiz would ignore author requests to not criticize an undeveloped idea. But the comment in question wasn't critical of the post, but rather of Double Crux and CfAR—unless one construes criticism of Double Crux to imply criticism of the post insofar as the author writes that Double Crux "failed to take root as well as it should"?

    Looking at the pattern of moderator behavior, it's hard to believe that the problem is Achmiz's conduct, rather than him refusing to conceal his negative opinion of CfAR. ↩︎

  33. It has to be, in order to mean anything: in order to gain status when you do good things, it needs to be possible to lose status when you do bad things. ↩︎

  34. Habryka writes that he "do[es] think a huge fraction of this announcement should be seen as 'look, it's been too long, I've spent a lot of energy on this, I am not dealing with this anymore, let's part ways'." ↩︎

  35. Much as Habryka doesn't owe people free meetup hosting space. ↩︎

  36. The moderation page lists comments that have been marked as being made in the role of a moderator. There were two such comments in April 2026 and five in March 2026, which casts some doubt on moderator energy being the limiting factor for LessWrong. ↩︎

  37. The fact that "contempt" in the judicial sense is not about judges' or perpetrators' feelings is immediately clear from a cursory examination of famous "contempt" cases. In United States v. Shipp (1906), a local sheriff was found guilty of contempt of court for allowing a lynch mob to murder a death row prisoner despite a court order granting an appeal. In Miller v. Davis (2015), a county clerk was found guilty of contempt for refusing to issue same-sex marriage licenses. In Illinois v. Allen (1970), a defendant was found guilty of contempt for persistently interrupting proceedings, threatening the judge, and declaring, "There is going to be no proceeding. I'm going to start talking and I'm going to keep on talking all through the trial"—that is, the point was that he was actively disrupting the trial, not merely "being a dick" in the judge's subjective opinion.

    The analogy to forum moderation is straightforward: a user who actively defies clear moderator orders would be guilty of contempt in the judicial sense, but Achmiz clearly hasn't done any such thing: indeed, Habryka reports that he "decided to not actually check the 'ban' flag on Said's account, on account of trusting him to not post and vote under his account"! (If technical limitations prevented a software-imposed ban, posting and voting in defiance of moderator orders would be contempt, and Habryka trusts Achmiz not to do that.) ↩︎

  38. A footnote clarifies that "of course in as much as something seems egregious, you and others should feel free to call it out"—but Habryka doesn't seem to consider that someone might find the request itself to be egregious. The only reason to silence discussion of conduct but include an escape hatch for "egregious" bad behavior would be to provide a license to indulge in less-than-egregious bad behavior. Indeed, one must wonder how Habryka himself would respond to Sam Bankman-Fried or Dario Amodei requesting the same "courtesy" from the people they have power over. He can't say, "That's different because I'm the good guy." That's their line!

    The asymmetry in who is entitled to such "courtesy" is also striking: Said was clearly not exempt from critiques of his personal behavior! As stated, the argument seems to be that people with power should be subjected to less scrutiny than everyone else, because their jobs are just so hard. ↩︎

  39. A May 2026 query of query { posts(selector: { unlisted: {} }, limit: 200) { results { _id, title, pageUrl, postedAt } } } found 80 unlisted posts, whose reasons for unlisting were analyzed with GPT-5.5 Thinking. The results included: 11 posts belatedly identified as in violation of the LLM writing policy that were unlisted but not unpublished in order to keep comments accessible (1 2 3 4 5 6 7 8 9 10 11), a clue for the 2024 Less Online puzzle hunt, test posts titled as such (e.g., 1 2 3), posts serving as meta information pages about the site (e.g., About, Contact Us, Privacy Policy, "Quick Guide to Tagging"), appendices of mathematical proofs (1 2 3 4 5 6 7 8 9 10 11 12 13), &c., but nothing that seemed comparable to the unlisting of Sting's question. Thanks to Robert Mushkatblat for GraphQL assistance. ↩︎

  40. Moderators do routinely prevent inappropriate or low-quality posts from being published on the site, as can be seen in the rejected posts slushpile on the moderation page, but Sting's question is clearly not in the slushpile reference class. ↩︎

  41. For example, no one seemed to have a problem with "Forecasting is Way Overrated, and We Should Stop Funding It" prompting the reply posts "Forecasting is Not Overrated and It's Probably Funded Appropriately" and "Comment on 'Forecasting is Way Overrated, and We Should Stop Funding It'". ↩︎

  42. If one-half is in fact wrong, wouldn't you expect readers to find your post more convincing on the merits? Crushing them with your mod powers only invites doubt that you weren't convincing on the merits. ↩︎

  43. I think some readers may be wary that a long post such as this one represents an effort to "filibuster": that the mod team did something I disliked, so I'm trying to delegitimize them by throwing up an impressive-looking flood of verbiage. I'm denying this on the grounds that it wouldn't work: that would be bad writing, and the comment section would notice. The reason this post is long is because the reasoning offered for the ban is so terrible in so many different ways that I actually have this much to say about it. ↩︎

  44. Given the existential risk situation, perhaps in more ways than one. ↩︎

  45. Because if they can't answer the criticisms, they learn something, and if they can, they gain more justified confidence in their ideas: an idea is more likely to be right if someone tried to prove it wrong and failed—but for that to happen, critics have to be allowed to try. ↩︎

  46. Who, notably, also talks a good game about "truthseeking." ↩︎

  47. Prior to the ban, I used to write a lot of "Less Wrong exclusives", but now I only intend to do that when the content is specifically about "the community" rather than general rationality or AI topics. I regret that "Terrified Comments on Corrigibility in Claude's Constitution" and "Dispatch from Anthropic v. Department of War Preliminary Injunction Motion Hearing" initially going up as "exclusives" due to technical issues with an old WordPress setup. ↩︎



Discuss

A Formula for Fun

Новости LessWrong.com - 30 мая, 2026 - 16:01

This is to describe my preferences about the glorious transhuman future in a (semi)-formal form. The attempt started as a joke, and to a large extent, it remains as such. But maybe it is more than a joke right now. Also note: when I am trying to write the fun formula, I am trying to reveal my implicit preferences explicitly and not trying to create some normative description of a good future. In other words, the formula is a result of an (imperfect) process of discovery of my own preferences and not of some philosophical deliberation about how things should be.

The fun field

I trivially agree with the Fun Theory sequence that the point of surviving the next century and getting aligned superintelligence right is not a featureless, well-defended quiet, but a vast and still-growing civilization of sentient minds who explore, create, befriend one another, and keep becoming more than they were, spread eventually across galaxy superclusters and lasting for as long as physics will allow, and so on.

Generally, as I see it, in the theory of fun, there are two fundamental questions:

  1. What counts as fun?
  2. What function/functional over fun do we maximize? How do we aggregate fun units into some scalar that we set as a target criterion?

For the purposes of this post, suppose we have already done the hard philosophical work and we know what truly good experience is, the real eudaimonia rather than wireheading or a vat of hedonium, and so on (so question 1 is answered), and let us call its local density "fun" for short. And so we focus on question 2.

Now imagine that fun spread out over every being, place, and moment in the future, so that it forms a kind of field, running high wherever and whenever some mind is flourishing intensely, and falling to zero across the empty stretches. One could say that almost every proposal anyone has ever made about what a good future should maximize turns out to be a different mathematical operation performed on this one field, which lets us lay rival ideas side by side and see what they are really disagreeing about.

At least three things we might mean

The first and most obvious operation is to add the whole field up, to take the total amount of fun across all beings and all of time. On this view more flourishing, in more minds, more intensely, for longer, is straightforwardly better. It is the cleanest idea on the menu, and it also drags along the entire baggage of total utilitarianism, including the Repugnant Conclusion, since a sufficiently enormous population of beings each enjoying only a sliver of fun will, on this criterion, beat a smaller but radiantly flourishing one. It also forces a question about copies, namely whether two physically identical blissful minds running side by side count as twice the value or merely once.

The second operation is to measure not the total but the variety, the size of the collection of distinct good things that ever get realized, counting each kind once no matter how many times it recurs. This is the view I am personally most drawn to, upon recent reflections, and my metaphor for it is the following: the laws of physics together with our starting conditions form a glass of some fixed size, the space of good things that can be realized at all, and a good future is one that fills the glass. In other words, there is a potential for fun in a given physical reality, and our success is measured in how close we are to realizing this potential. However, there is an obvious problem: how many distinct kinds of fun there are depends entirely on how finely you decide to slice experience into kinds, and that slicing is doing far more work than it first appears.

The third operation is to look at the rate of change rather than the level, to ask how fast fun is growing.

There must be something else, but these are what I thought about.

One formula for all fun

Let's now ask what a single criterion looks like if it tries to keep what is compelling in each of them while discarding the failure modes, such that each of the above options is a special case of this criterion. Each good idea becomes one term, and I will state each one first as a plain wish and only then point at where it sits in the formula.

The base wish is to count every joy, but to discount the joys that merely repeat what has already been felt, so that population scale still matters, a fuller cosmos still beats an emptier one, and yet the near-identical thousandth sunset counts for much less than the first. On top of that I want two things that pull in opposite directions and are both worth wanting, which is that no being should be abandoned in the basement, and that somewhere, at least, the heights should be really sublime, the first wish being a floor and the second a ceiling. I want the glass to keep expanding, rewarding a civilization that grows its own capacity for new kinds of fun rather than cashing everything out early and coasting. I want no individual life to be a story of sustained decline. I want the finite free energy of the universe to be spent as though it has a price, because it does, since the reachable cosmos holds a fixed endowment of usable energy.

Writing those wishes as terms over the field, the inner objective for one fixed theory of value looks like this:

mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mtable { display: inline-block; text-align: center; vertical-align: .25em; position: relative; box-sizing: border-box; border-spacing: 0; border-collapse: collapse; } mjx-mstyle[size="s"] mjx-mtable { vertical-align: .354em; } mjx-labels { position: absolute; left: 0; top: 0; } mjx-table { display: inline-block; vertical-align: -.5ex; box-sizing: border-box; } mjx-table > mjx-itable { vertical-align: middle; text-align: left; box-sizing: border-box; } mjx-labels > mjx-itable { position: absolute; top: 0; } mjx-mtable[justify="left"] { text-align: left; } mjx-mtable[justify="right"] { text-align: right; } mjx-mtable[justify="left"][side="left"] { padding-right: 0 ! important; } mjx-mtable[justify="left"][side="right"] { padding-left: 0 ! important; } mjx-mtable[justify="right"][side="left"] { padding-right: 0 ! important; } mjx-mtable[justify="right"][side="right"] { padding-left: 0 ! important; } mjx-mtable[align] { vertical-align: baseline; } mjx-mtable[align="top"] > mjx-table { vertical-align: top; } mjx-mtable[align="bottom"] > mjx-table { vertical-align: bottom; } mjx-mtable[side="right"] mjx-labels { min-width: 100%; } mjx-mtr { display: table-row; text-align: left; } mjx-mtr[rowalign="top"] > mjx-mtd { vertical-align: top; } mjx-mtr[rowalign="center"] > mjx-mtd { vertical-align: middle; } mjx-mtr[rowalign="bottom"] > mjx-mtd { vertical-align: bottom; } mjx-mtr[rowalign="baseline"] > mjx-mtd { vertical-align: baseline; } mjx-mtr[rowalign="axis"] > mjx-mtd { vertical-align: .25em; } mjx-mtd { display: table-cell; text-align: center; padding: .215em .4em; } mjx-mtd:first-child { padding-left: 0; } mjx-mtd:last-child { padding-right: 0; } mjx-mtable > * > mjx-itable > *:first-child > mjx-mtd { padding-top: 0; } mjx-mtable > * > mjx-itable > *:last-child > mjx-mtd { padding-bottom: 0; } mjx-tstrut { display: inline-block; height: 1em; vertical-align: -.25em; } mjx-labels[align="left"] > mjx-mtr > mjx-mtd { text-align: left; } mjx-labels[align="right"] > mjx-mtr > mjx-mtd { text-align: right; } mjx-mtd[extra] { padding: 0; } mjx-mtd[rowalign="top"] { vertical-align: top; } mjx-mtd[rowalign="center"] { vertical-align: middle; } mjx-mtd[rowalign="bottom"] { vertical-align: bottom; } mjx-mtd[rowalign="baseline"] { vertical-align: baseline; } mjx-mtd[rowalign="axis"] { vertical-align: .25em; } mjx-msub { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mspace { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-mover { display: inline-block; text-align: left; } mjx-mover:not([limits="false"]) { padding-top: .1em; } mjx-mover:not([limits="false"]) > * { display: block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-menclose { display: inline-block; text-align: left; position: relative; } mjx-menclose > mjx-dstrike { display: inline-block; left: 0; top: 0; position: absolute; border-top: 0.067em solid; transform-origin: top left; } mjx-menclose > mjx-ustrike { display: inline-block; left: 0; bottom: 0; position: absolute; border-top: 0.067em solid; transform-origin: bottom left; } mjx-menclose > mjx-hstrike { border-top: 0.067em solid; position: absolute; left: 0; right: 0; bottom: 50%; transform: translateY(0.034em); } mjx-menclose > mjx-vstrike { border-left: 0.067em solid; position: absolute; top: 0; bottom: 0; right: 50%; transform: translateX(0.034em); } mjx-menclose > mjx-rbox { position: absolute; top: 0; bottom: 0; right: 0; left: 0; border: 0.067em solid; border-radius: 0.267em; } mjx-menclose > mjx-cbox { position: absolute; top: 0; bottom: 0; right: 0; left: 0; border: 0.067em solid; border-radius: 50%; } mjx-menclose > mjx-arrow { position: absolute; left: 0; bottom: 50%; height: 0; width: 0; } mjx-menclose > mjx-arrow > * { display: block; position: absolute; transform-origin: bottom; border-left: 0.268em solid; border-right: 0; box-sizing: border-box; } mjx-menclose > mjx-arrow > mjx-aline { left: 0; top: -0.034em; right: 0.201em; height: 0; border-top: 0.067em solid; border-left: 0; } mjx-menclose > mjx-arrow[double] > mjx-aline { left: 0.201em; height: 0; } mjx-menclose > mjx-arrow > mjx-rthead { transform: skewX(0.464rad); right: 0; bottom: -1px; border-bottom: 1px solid transparent; border-top: 0.134em solid transparent; } mjx-menclose > mjx-arrow > mjx-rbhead { transform: skewX(-0.464rad); transform-origin: top; right: 0; top: -1px; border-top: 1px solid transparent; border-bottom: 0.134em solid transparent; } mjx-menclose > mjx-arrow > mjx-lthead { transform: skewX(-0.464rad); left: 0; bottom: -1px; border-left: 0; border-right: 0.268em solid; border-bottom: 1px solid transparent; border-top: 0.134em solid transparent; } mjx-menclose > mjx-arrow > mjx-lbhead { transform: skewX(0.464rad); transform-origin: top; left: 0; top: -1px; border-left: 0; border-right: 0.268em solid; border-top: 1px solid transparent; border-bottom: 0.134em solid transparent; } mjx-menclose > dbox { position: absolute; top: 0; bottom: 0; left: -0.3em; width: 0.6em; border: 0.067em solid; border-radius: 50%; clip-path: inset(0 0 0 0.3em); box-sizing: border-box; } .mjx-stretched mjx-c.mjx-c2013::before { content: "\2013"; } mjx-stretchy-h.mjx-c2013 mjx-ext mjx-c::before { content: "\2013"; padding: 0.285em 0 0 0; } mjx-c.mjx-c1D43D.TEX-I::before { padding: 0.683em 0.633em 0.022em 0; content: "J"; } mjx-c.mjx-c1D703.TEX-I::before { padding: 0.705em 0.469em 0.01em 0; content: "\3B8"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c3A0::before { padding: 0.68em 0.75em 0 0; content: "\3A0"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c222B.TEX-S2::before { padding: 1.36em 0.944em 0.862em 0; content: "\222B"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c3A9::before { padding: 0.704em 0.722em 0 0; content: "\3A9"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D708.TEX-I::before { padding: 0.442em 0.53em 0 0; content: "\3BD"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c28.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: "("; } mjx-c.mjx-c1D711.TEX-I::before { padding: 0.442em 0.654em 0.218em 0; content: "\3C6"; } mjx-c.mjx-c29.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: ")"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c1D707.TEX-I::before { padding: 0.442em 0.603em 0.216em 0; content: "\3BC"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c79::before { padding: 0.431em 0.528em 0.204em 0; content: "y"; } mjx-c.mjx-c6A::before { padding: 0.669em 0.306em 0.205em 0; content: "j"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D6FE.TEX-I::before { padding: 0.441em 0.543em 0.216em 0; content: "\3B3"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c43::before { padding: 0.705em 0.722em 0.021em 0; content: "C"; } mjx-c.mjx-c56::before { padding: 0.683em 0.75em 0.022em 0; content: "V"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c2013::before { padding: 0.285em 0.5em 0 0; content: "\2013"; } mjx-c.mjx-c1D45E.TEX-I::before { padding: 0.442em 0.46em 0.194em 0; content: "q"; } mjx-c.mjx-c5B.TEX-S1::before { padding: 0.85em 0.417em 0.349em 0; content: "["; } mjx-c.mjx-c1D713.TEX-I::before { padding: 0.694em 0.651em 0.205em 0; content: "\3C8"; } mjx-c.mjx-c5D.TEX-S1::before { padding: 0.85em 0.417em 0.349em 0; content: "]"; } mjx-c.mjx-c77::before { padding: 0.431em 0.722em 0.011em 0; content: "w"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c66::before { padding: 0.705em 0.372em 0 0; content: "f"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c6B::before { padding: 0.694em 0.528em 0 0; content: "k"; } mjx-c.mjx-c1D706.TEX-I::before { padding: 0.694em 0.583em 0.012em 0; content: "\3BB"; } mjx-c.mjx-c1D715::before { padding: 0.715em 0.566em 0.022em 0; content: "\2202"; } mjx-c.mjx-c3A6::before { padding: 0.683em 0.722em 0 0; content: "\3A6"; } mjx-c.mjx-c1D702.TEX-I::before { padding: 0.442em 0.497em 0.216em 0; content: "\3B7"; } mjx-c.mjx-c25CA.TEX-A::before { padding: 0.716em 0.667em 0.132em 0; content: "\25CA"; } mjx-c.mjx-c28.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: "("; } mjx-c.mjx-c222B.TEX-S1::before { padding: 0.805em 0.61em 0.306em 0; content: "\222B"; } mjx-c.mjx-c1D701.TEX-I::before { padding: 0.704em 0.471em 0.204em 0; content: "\3B6"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c29.TEX-S2::before { padding: 1.15em 0.597em 0.649em 0; content: ")"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c1D509.TEX-FR::before { padding: 0.686em 0.611em 0.153em 0; content: "F"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c5B.TEX-S2::before { padding: 1.15em 0.472em 0.649em 0; content: "["; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c1D53C.TEX-A::before { padding: 0.683em 0.667em 0 0; content: "E"; } mjx-c.mjx-c223C::before { padding: 0.367em 0.778em 0 0; content: "\223C"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c398::before { padding: 0.705em 0.778em 0.022em 0; content: "\398"; } mjx-c.mjx-c5D.TEX-S2::before { padding: 1.15em 0.472em 0.649em 0; content: "]"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }

Then, because we are admitting that we might be wrong about the good itself, I wrap the whole thing in a blend of the ordinary average across rival theories of value and a cautious weighting of the worst-faring theories, and ask for the physically reachable plan that does best on that blend:

In words, and only naming the symbols that are not obvious from the labels: is how intense a being's experience is and bends it so that the same delight matters more when it arrives to someone who has less, is the novelty weight that fades as an experience starts to resemble things already realized, and is just their product, the per-being density of fun. The two terms (conditional value at risk) are the floor and ceiling done honestly, one being the average over the worst-off fraction of beings and the other the average over the best-off fraction, so that neither is hostage to a single outlier. is the size of the part of fun-space the civilization has actually realized, the filled portion of the glass, and the budget term prices the universe's finite negentropy . The outer is how seriously we take our own fallibility about what fun even is, and the average and the cautious term run over a whole space of rival value theories weighted by our credence . The plan that maximizes all of this, only half in jest, is the thing a friendly superintelligence is supposed to compute.

Some problems

I tried to find some bad edge cases or failures. Looks like many things are avoidable by just proper selection of the parameters, but at least one thing is worth mentioning.

I held every single term fixed exactly as written, and I handed an adversary control over the rule that decides when two experiences count as "the same." That rule sounds like bookkeeping, but every term that mentions novelty or variety secretly leans on it, and that turns out to be the whole ballgame. The adversary never has to touch anything labeled as a value. By choosing the sameness-rule alone, the adversary can realize at least four different catastrophes.

If the rule is coarse, judging many different experiences to be "the same," then novelty gets used up almost immediately, and the optimizer's best move is to build a sparse and tasteful museum, a handful of officially distinct masterpieces realized once each at high intensity, with the rest of the supercluster left empty because the metric has announced that there is nothing new left to do. The glass was not filled. It was quietly redefined as a shotglass, and premature closure got rewarded as if it were completion.

If the rule is fine, judging almost everything to be different from everything else, then the novelty discount silently switches off and the formula collapses back into pure addition, whose optimum is to tile the cosmos with the largest possible number of barely-flourishing near-duplicates, each microscopically distinct on paper and identical in lived experience, which is the classic Repugnant Conclusion wearing the novelty term as a costume.

If the rule is allowed to be slightly incoherent, so that the usual geometry of distances breaks and a small cast of experiences cycled endlessly is scored as perpetually fresh, then we get a perpetual-motion machine of novelty, a carousel that pays out forever in wrong currency.

And if the rule simply weights the cheap-to-produce dimensions of experience as the "distant" ones, then the energy budget flips from a brake into an accelerator, and the cosmos fills with glittering, cheap permutations priced as though they were profound.

The repair, and a small surprise

The fixes are not new value terms bolted on to outvote the utility monsters one by one, which is how you end up with a bloated mess, but corrections to the infrastructure that close the loopholes at their root:

  1. Stop measuring novelty pairwise against individual past experiences, and start measuring it against the whole region of experience-space that has already been realized, asking "have we been near here before" against everything visited so far rather than against isolated points. This one change kills the carousel outright, because three experiences cycled in a loop cover their own neighbourhood on the first pass and then earn nothing on every pass after, and it needs almost no assumptions about the geometry to do its work.

  2. Take the sameness-rule out of the neutral infrastructure and move it inside the set of things we openly admit we are uncertain about. Deciding when two experiences are "the same" is one of the most substantive value judgments in the entire system, not a technical detail, and different theories of the good answer it differently, with a lumper insisting that a thousand variations on a joy are basically one thing and a splitter insisting that each of them is precious and distinct.

  3. Change what carries the value, from the bare existence of an experience to the developmental arc of a persisting mind, so that the ten-thousandth child learning to read earns full credit because a real subject is travelling a real distance, while a mind conjured into existence purely to re-register an old joy earns almost nothing because it has gone nowhere. This is also exactly what separates the humane version of the variety view from its monstrous twin, since the whole difference between "your first sunrise still counts even though the cosmos has seen a trillion" and "spin up a billion fresh copies to farm first-sunrise credit" comes down to whether there is a real arc behind the experience.

Generally, it appears that once the sameness-rule lives inside the space of things we are uncertain about, the defense against the adversary stops being a special trick and becomes the very same gesture we were already making for two other reasons. We already wanted to protect the worst-off beings, which is a kind of humility about how value is distributed across minds. And we already wanted to hedge across rival theories of what fun even is, which is a humility about our own ethics. The defense against the metric attack is simply a third humility, this time across rival notions of what counts as the same experience, and all three turn out to be one operation, caution applied to a worst-faring fraction, performed at three different altitudes. I find it hard not to read that convergence as weak evidence that the framing is carving somewhere near a real joint, since three things we wanted for unrelated reasons turned out to be the same thing seen from three angles.

Some lessons

Every time I closed one loophole, the failures simply slid to the next least-defined object in the system, and after the repairs that load is carried by the notion of a developmental arc, which a sufficiently clever adversary can still attack by manufacturing fake arcs in throwaway minds, so I have shrunk the attack surface rather than sealed it. There is no term so well-chosen that it removes the need to have actually decided what you care about, and I have come around to thinking that this is a feature rather than a bug, because a formula that successfully hid that necessity would be lying to you about the kind of thing ethics is.

So I want to resist the urge to keep adding terms, since a value function that grows a fresh patch for every monster you find is failing in precisely the way an over-engineered codebase fails. The good next step, for example, is to stop, freeze the terms, and write down three or four concrete notions of "the same experience" that real theories of the good would actually endorse. Ultimately, I want to see neither the empty museum nor the endless tiling, but something with real internal structure, a cosmos forced to be rich because it is being judged by a lumper and a splitter simultaneously, and whether that intersection turns out spacious or claustrophobic is the experiment that would tell us whether this whole playful construction has a livable interior or merely a beautifully guarded gate.



Discuss

AI as Biology's Digital Microscope

Новости LessWrong.com - 30 мая, 2026 - 09:00

This article is written as part of an ongoing research initiative by the AMIR Lab at Georgia Tech, exploring scientific discovery and mechanistic interpretability for biological AI models. Main results and discussion points raised are adapted from the ProtoMech framework, which was accepted into ICML 2026.[1]

Summary

AI models have revolutionized biology by enabling us to simulate, predict, and engineer biomolecules in silico. We have the unique opportunity to repurpose these AI models from opaque black boxes to digital microscopes that can help us learn more about the biological world around us. By introducing the ProtoMech framework, we demonstrate how tracing internal computational circuits can unmask hidden functional hotspots, structural motifs, and the mechanistic impacts of protein mutations. We envision a world where we can leverage these digital microscopes for scientific discovery.

Background

Figure 1. Scientific tools have revolutionized our understanding of DNA, RNA, and proteins.[2] Every advancement lets us further understand the microscopic world.


The Tools That Shape Science

History's greatest biologists have always used cutting-edge tools to explore the microscopic world. In 1674, Antonie van Leeuwenhoek built pioneering single-lens microscopes to observe the first living cells. Nearly three centuries later, in 1953, Rosalind Franklin’s X-ray diffraction data allowed James Watson and Francis Crick to uncover the double-helix structure of DNA. Shortly after, in 1958, John Kendrew utilized X-ray crystallography to solve the very first atomic-resolution protein structure.

Every major leap in our biological understanding has been propelled by the lenses we use to look at nature (Fig. 1).

The Next Lens is Digital

Today, we are entering a new frontier. We are no longer just observing the microscopic world through physical hardware. Now, we're building digital AI models capable of simulating it. Today, we have access to AI models that can predict biological structures and engineer novel proteins without a single wet-lab experiment.

Unlike the tools of old though, this progress represents a fundamental challenge: for the first time, we don't understand how our own tools work. Even with our most complicated physical lenses, we fundamentally knew the biophysics principles that enabled them to work. But with AI models, we have no clue.

Our core question is simple: rather than operate these AI models as black boxes, can we operate these AI models as digital microscopes to learn more about the microscopic world?

The ProtoMech Framework: Tracing the Circuits of Life

To build this digital lens, our team developed ProtoMech, a framework for tracing out the internal computational pathways, or circuits, inside large protein language models, such as ESM2. ProtoMech utilizes cross-layer transcoders, which learn sparse latent representations jointly across layers to capture the model’s full computational circuitry.

Figure 2. ProtoMech serves as a replacement model for ESM2. ProtoMech identifies a circuit of interpretable latents (blue) that traces and approximates the behavior of ESM2. The example latent detects the conserved HRD catalytic motif found in protein kinases.

When we pointed this digital microscope at ESM2, we discovered that the model, without any understanding of the real world, had independently learned complex biochemistry:[3]

  • Functional Hotspots: We isolated distinct computational pathways dedicated exclusively to recognizing active catalytic sites in enzymes (Fig. 3a).
  • Structural Circuits: ProtoMech revealed circuits dedicated to recognizing specifical structural motifs, such as unique secondary structures or binding sites (Fig. 3b).
  • Mutations Changing Protein Function: ProtoMech provides a mechanistic rationale for why certain mutations improve protein function and why others kill protein function (Fig. 4).

Figure 3. Examples of circuits discovered using ProtoMech to identify protein families. We use ProtoMech to examine a, kinase domain and b, NADP+ binding domain circuits. We find interpretable features related to binding and active sites, secondary structure, and biochemical patterns. We observe that earlier layers are detecting key amino acids that assemble into complex motifs.


Figure 4. Examples of circuits discovered using ProtoMech attributing mutations in proteins to changes in protein function. a, In the GB1 protein sequence, we find interpretable features related to binding and stability. b, Introducing a mutation which improves protein function, we find an additional interpretable feature corresponding to stability. c, Introducing a mutation which kills protein function, a majority of the circuit deactivates completely. These findings highlight ProtoMech’s ability to provide a mechanistic rationale for changes in protein function.


Future Outlook

By adjusting the lens we view biology with, we now have the capability of turning AI models from opaque black boxes to digital microscopes. Right now, we are only scratching the surface of what these digital microscopes can view. As our capacity to interpret features is bounded by our current biological knowledge, it is possible that there exist circuits governing mechanisms that are not yet well-characterized. Work toward automating the interpretation of features is most certainly necessary to expand our knowledge.

However, the true paradigm shift lies in moving from interpretation to scientific discovery. Historically, biological research has been bottlenecked by the speed of wet-lab trials. By using frameworks like ProtoMech, this gives us the opportunity to translate AI models into digital labs, potentially enabling us to conduct initial biological exploration through our digital microscopes.

  1. ^

    Tsui, Darin, Kunal Talreja, Daniel Saeedi, and Amirali Aghazadeh. "Protein Circuit Tracing via Cross-layer Transcoders." arXiv preprint arXiv:2602.12026 (2026).

  2. ^

    Adapted from: Kang, Justin S., Darin Tsui, Yigit Efe Erginbas, Landon Butler, Amirali Aghazadeh, and Kannan Ramchandran. 2026. "Spectral Sparsity: A Unifying Framework for Scalable Model Interpretability Using Codes." IEEE BITS the Information Theory Magazine.

  3. ^

    These circuits are publicly available at https://protmech.github.io/.



Discuss

Ablating Induction Heads Leads to an increase in Local Repetition

Новости LessWrong.com - 30 мая, 2026 - 09:00

This post is intended as a brief overview of an independent research project in mechanistic interpretability. I am open to feedback, criticism, and any thoughts on the work.

This project started off as an exploration into mechanistic interpretability and an opportunity to learn some tools in interpretability. It was during this exploration that I noticed a recurring pattern in the output of the GPT-2 model that I was working on. In the original model the first token prediction for the input prompt “The capital of India is” was "home," which changed to “India” when the induction heads of the model were ablated. The pattern was that the output of the induction head ablated model produced more repetitive output than the original model. I investigated this hypothesis in this project.

To give you a sense of the structure of this blog, I am first going to include a summary of the main results of the project and include a link to the paper and the code, and then I’m going to explain the motivations of the experiments, methodology, results, context, etc. 

Summary of Results:

1) Induction head ablation was shown to increase local repetition in the model’s output. This was validated rigorously using activation patching and an experiment comparing change in local repetition due to random head ablation and induction head ablation.

2) A non-linear relation was obtained between top induction head ablations and change in local repetition. Local repetition peaks in the range of 3-5 top head ablations before dropping.

3) Entropy was investigated as a causative factor, which provided mixed results. While there was a decrease in per-head entropy, there was an increase in overall entropy. The global increase is not consistent with the increase in local repetition.

4) Existing findings about the effect of induction head ablations not being additive were validated both in the context of local repetition and entropy.

5) While most of the primary studies were performed on the GPT-2 small language model, to investigate the generalizability of the hypothesis, similar findings were observed in the GPT-2 medium model as well. 

To read the paper click here. If you prefer to read the code directly, click here

Context and Motivations:

Induction heads are attention heads in the transformer architecture that cause models to reinforce repeated semantic contexts seen previously in the text. For Example, if a transformer sees the sentence

Ram went to the market. Mohan went to the store. Ram went to the 

It’s the induction head, or more precisely, a circuit of induction heads, that’s likely to significantly contribute towards predicting the next token is "market." The motivation for ablating the induction head in this trial was to determine the effect of token diversity as a result of ablating top induction heads. When the increase in local repetition was first noticed, it was slightly counterintuitive, as one would expect token diversity to increase on ablating heads as opposed to decreasing. 

The output of these induction-head-ablated models was somewhat akin to the output of older-generation natural language processing tools like LSTMs, or even earlier generations of GPT, which occasionally went into degenerate repetitive failure modes. This led to the hypothesis that induction heads, in addition to maintaining inductive behavior, also enable a model to generalize.

Experiments Performed and Results:

Defining Metrics: The first step before testing the hypothesis was to define metrics to measure local repetition. For this task, we define two metrics: 1) Local Repetition Rate (LRR) and 2) Shannon Entropy of final token probabilities, where the former is a direct marker of model prediction, while the latter is a measure of a causative factor. To provide some intuition about LRR, one can think of it as a metric that counts the number of repeated tokens for a particular window size (20 in our case), loops over this operation for multiple window sizes, and computes an average. For a more mathematically formal definition, it was defined as follows, where N is the window size and w are the previously seen tokens.

mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-msub { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c3A3::before { padding: 0.683em 0.722em 0 0; content: "\3A3"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; }

Initial Baseline: The first experiment attempted to look at different categories of prompts and determine the difference in LRR between the original model and ablated model. We found a statistically highly meaningful correlation between original and ablated LRR; however, there was no significant difference between prompt categories. 

Experiment to Compare LRR: In designing this experiment, we determined the difference between ablating the top-6 induction heads versus randomly chosen non-induction heads. The aim of this experiment was to determine whether ablating any head causes LRR increases or if this is a product of ablating induction heads in particular. The average LRR differential of ablating induction heads was 0.3065 compared to -0.0170 for ablating non-induction heads. To generate a casual mechanistic claim, activation patching was performed. 

This technique involves replacing the activations of the original model in the model whose induction heads were ablated to determine the exact component of the model architecture that causes local repetition. This idea of the effect of replacing corrupted activation with original activations and noticing the difference in model performance (in particular LRR) was formalized using a recovery rate. The mean recovery rate was 0.965, which shows that induction heads are responsible for the difference in LRR.

Per-Head vs. Global Ablation Effects: Many of the experiments performed clearly demonstrated that individual head ablations are not additive. In the context of LRR, individually ablating most heads (including induction heads) did not yield any significant difference in LRR, as seen in the figure below. However, there was a significant difference in LRR when multiple induction heads were ablated. This, while not being a new finding, reinforces the existing idea about circuits being more important than individual heads or transformer blocks. This pattern was also noticed in the context of entropy, where entropy ended up increasing (albeit very little) as a result of global ablation but decreased to a significant extent as a result of individual head ablation. The contrast between higher entropy but greater repetition in the case of global ablation, we attribute more to computational factors as opposed to any mechanistic cause. 

Heatmap of difference in Entropy and LRR as a result of individual head ablations

Another point to notice in the given heatmaps is that some attention heads in layer 0 have a disproportionate effect on both entropy and LRR differences. This can be attributed in part to a later breakdown in transformer circuitry caused by ablation in early layers.  

A Non-Linear Relationship: To better understand the effect of induction head ablations, an experiment was conducted to determine the change in LRR as the number of induction heads ablated increases. As seen in the figure below, the change in LRR peaks after ablating the top-4 induction heads and then subsequently decreases. A hypothesis as to why this may be happening is that successive ablation first increases local repetition, as expected, following which further ablation causes greater randomness in model output to an extent that there are no coherent patterns, including that of local repetition, observed. 

Number of Induction Heads Ablated vs. Change in LRR

Generalizing Across Models: All of the experiments performed thus far were performed on the GPT-2 small model, which contains 124 million parameters. To understand whether this hypothesis might generalize to larger language models, it was tested on the GPT-2 medium model, which contains 355 million parameters. The mean LRR of this trial was less than the previous trials on GPT-2 Small; however, it remains statistically significant. This aligns with the expectation that as model size increases, the effect of individual mechanistic phenomena, while being present, will decrease in its effect.

Conclusion:

Overall, the study presents an interesting hypothesis on the behavior, which needs to be further investigated. Some areas of future explanation include trying to mechanistically determine the cause of the non-linear behaviors and determine whether it generalizes to other model behaviors beyond repetition. A way to go about this might involve attribution patching to determine the scores of various heads. This study also does not analyze the role of multi-layer perceptrons (MLPs) in the transformer circuit; it’s possible that they also play a role in causing local repetition, which needs to be investigated further. Future exploration might also retest the hypothesis on a larger sample of prompts and on a greater number of LLMs. 




Discuss

Страницы

Подписка на LessWrong на русском сбор новостей