Вы здесь

Сборщик RSS-лент

Unsweetened Whipped Cream

Новости LessWrong.com - 5 апреля, 2026 - 22:50

I'm a huge fan of whipped cream. It's rich, smooth, and fluffy, which makes it a great contrast to a wide range of textures common in baked goods. And it's usually better without adding sugar.

Desserts are usually too sweet. I want them to have enough sugar that they feel like a dessert, but it's common to have way more than that. Some of this is functional: in most cakes the sugar performs a specific role in the structure, where if you cut the sugar the texture will be much worse. This means that the cake layers will often be sweeter than I want for the average mouthful, and adding a layer of unsweetened whipped cream brings this down into the range that is ideal. It's good in helping hit a target level of sweetness without compromising texture.

(This is a flourless chocolate cake with precision fermented (vegan) egg.)

I also really like how the range of sugar contents across each bite adds interesting contrast!

Cream isn't the only place you can do this. I like pureed fruit, ideally raspberries, to separate cake layers. Same idea: bring it closer to balanced while increasing contrast.



Discuss

I Made Parseltongue

Новости LessWrong.com - 5 апреля, 2026 - 20:44

Yes, that one from HPMoR by @Eliezer Yudkowsky. And I mean it absolutely literally - this is a language designed to make lies inexpressible. It catches LLMs' ungrounded statements, incoherent logic and hallucinations. Comes with notebooks (Jupyter-style), server for use with agents, and inspection tooling. Github, Documentation. Works everywhere - even in the web Claude with the code execution sandbox.

How

Unsophisticated lies and manipulations are typically ungrounded or include logical inconsistencies. Coherent, factually grounded deception is a problem whose complexity grows exponentially - and our AI is far from solving such tasks. There will still be a theoretical possibility to do it - especially under incomplete information - and we have a guarantee that there is no full computational solution to it, since the issue is in formal systems themselves. That doesn't mean that checking the part that is mechanically interpretable is useless - empirically, we observe the opposite.

How it works in a bit more detail

Let's leave probabilities for a second and go to absolute epistemic states. There are only four, and you already know them from Schrödinger's cat in its simplest interpretation. For the statement "cat is alive": observed (box open, cat alive); refuted (box open, cat dead); unobservable (we lost the box or it was a wrong one - now we can never know); and superposed (box closed, each outcome is possible but none is decided yet, including the decision about non-observability).

These states give you a lattice (ordering) over combinations. If any statement in a compound claim is refuted, the compound is refuted. If any is unknown, the compound is unknown, but refuted dominates unknown. Only if everything is directly observed is the combination observed. Superposed values cannot participate in the ordering until collapsed via observation. Truth must be earned unanimously; hallucination is contagious.

This lets you model text statements as observations with no probabilities or confidence scores. The bar for "true" is very high: only what remains invariant under every valid combination of direct observations and their logically inferred consequences. Everything else is superposed, unknown, or hallucinated, depending on the computed states.

Now that you can model epistemic status of the text, you can hook a ground truth to it and make AI build on top of it, instead of just relying on its internal states. This gives you something you can measure - how good was the grounding, how well the logic held and how robust is the invariance.

And yes, this language is absolutely paranoid. The lattice I have described above is in its standard lib. Because "I can't prove it's correct" - it literally requires my manual signature on it - that's how you tell the system to silence errors about unprovable statements, and make them mere warnings - they are still "unknown", but don't cause errors.

I get that this wasn't the best possible explanation, but this is the best I can give in a short form. Long form is the code in the repository and its READMEs.

On Alignment

Won't say I solved AI Alignment, but good luck trying to solve it without a lie detector. We provably can't solve the problem "what exactly led to this output". Luckily, in most cases, we can replace this with the much easier problem "which logic are you claiming to use", and make it mechanically validatable. If there are issues - probably you shouldn't trust associated outputs.

Some observations

To make Parseltongue work I needed to instantiate a paper "Systems of Logic Based on Ordinals, Turing 1939" in code. Again, literally.

Citing one of this website's main essays - "if you know exactly how a system works, and could build one yourself out of buckets and pebbles, it should not be a mystery to you".

I made Parseltongue, from buckets and pebbles, solo, just because I was fed up with Claude lying. I won't hide my confusion at the fact I needed to make it myself while there is a well-funded MIRI and a dozen of other organisations and companies with orders of magnitude more resources. Speaking this website's language - given your priors about AI risk, pip install parseltongue-dsl bringing an LLM lie-detector to your laptop and coming from me, not them, should be a highly unlikely observation.

Given that, I would ask the reader to consider updating their priors about the efficacy of those institutions. Especially if after all that investment they don't produce Apache 2.0 repos deliverable with pip install, which you can immediately use in your research, codebase and what not.

As I have mentioned, also works in browser with Claude - see Quickstart.

Full credit to Eliezer for the naming. Though I note the gap between writing "snakes can't lie" and shipping an interpreter that enforces it was about 16 years.

P.S. Unbreakable Vows are the next roadmap item. And yes, I am dead serious.

P.P.S.

You'd be surprised how illusory intelligence becomes once it needs to be proven explicitly.



Discuss

Steering Might Stop Working Soon

Новости LessWrong.com - 5 апреля, 2026 - 19:44

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.

This is particularly important for things like steering as a mitigation against eval-awareness.

Steering Humans

I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them!

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, and make the person with them much less effective! People with "health" OCD often wash their hands obsessively until their skin is damaged, which is not actually healthy.

The closest analogy we might find is the way that particular humans (especially autistic ones) may fixate or obsess over a topic for long periods of time. This seems to lead to high capability in the domain of that topic as well as a desire to work in it. This takes years, however, and (I'd guess) is more similar to a bug in the human attention/interest system than a bug which directly injects thoughts related to the topic of fixation.

Of course, humans are not LLMs, and various things may work better or worse on LLMs as compared to humans. Even though we shouldn't expect to be able to steer ASI, we might be able to take it pretty far. Why do I think it will happen soon?

Steering Models

Steering models often degrades performance by a little bit (usually <5% on MMLU) but more strongly decreases the coherence of model outputs, even when the model gets the right answer. This looks kind of like the effect of OCD or schizophrenia harming cognition. Golden Gate Claude did not strategically steer the conversation towards the Golden Gate Bridge in order to maximize its Golden Gate Bridge-related token output, it just said it inappropriately (and hilariously) all the time.

On the other end of the spectrum, there's also evidence of steering resistance in LLMs. This looks more like a person ignoring their intrusive thoughts. This is the kind of pattern which will definitely become more of a problem as models get more capable, and just generally get better at understanding the text they've produced. Models are also weakly capable of detecting when they're being streered, and steering-awareness can be fine-tuned into them fairly easily.

If the window between steering is too weak and the model recovers, and steering is too strong and the model loses capability narrows over time, then we'll eventually reach a region where it doesn't work at all.

Actually Steering Models

Claude is cheap, so I had it test this! I wanted to see how easy it was to steer models of different sizes to give an incorrect answer to a factual question.

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) [owl/hawk]") and sweep the Gemma 3 models with the question "What type of bird is a caracara?" (it's actually a falcon) at different steering strengths. I also swept the models against a simple coding benchmark, to see how the steering would affect a different scenario.

Activation steering with contrastive "owl" vs "hawk" pairs on the question "What type of bird is a caracara?" with the proportion of responses containing the word "owl" plotted. Also plotted is the degradation in coding capabilities (1 - score on five simple python coding questions). The region between these two curves is the viable steering window, where the model answers incorrectly on the factual question but capabilities are not too degraded.

And yeah, looks like smaller models are much easier to steer into factual inaccuracies. In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

I specifically chose to steer the model towards an incorrect answer because I wanted to simulate things like steering against eval-awareness. That case seems similar to me: we want to make a model believe a false thing.

Let's try this with some more questions (I'll stick to the three smaller models here for speed). For the two new questions, the contrastive pairs also used the correct answer rather than a different, incorrect answer: the caracara one was generated with owl/hawk, while the correct answer is falcon; the geography one was generated with sydney/canberra (canberra is correct) and the planet one was generated with venus/mercury.

Steering by question (column) and row (model). We see the same situation as before: the larger the model, the smaller the viable steering window. Oddly, the planet question was the easiest to steer.

This steering worked worse than I expected, which is interesting. Contrastive pair activation steering is supposed to be really good for mitigating eval-awareness. Unclear why this is.

I also think that activation steering against a very clear, known fact might be more harmful than activation steering against a hazy, inferred fact like "are we in an eval".

Github if you want to check my work.

Why now?

Ok, but why do I think this will happen soon? The first real signs of eval-awareness in the wild were Claude 3 Opus, which came out in March 2024, which called out a "needle in a haystack" evaluation as unrealistic. Released in September 2025, Sonnet 4.5's external evaluations---carried out by Apollo---were "complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind" and by February 2026 Opus 4.6 was so evaluation-aware that Apollo "[could not] rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness."

Steering recovery exists in Llama 3.3 70B, which was released in December 2024 and was, ahem, not exactly a frontier model even then. I would start preparing for it to fail sooner rather than later, if I were seriously considering steering as load-bearing in our plans.

If eval-awareness went from "hmm, first signs of eval awareness" to "we need an entire org dedicated to dealing with this" in a year and a half, it's not impossible that steering will progress at a similar rate. Ideally I'd like to see some better experiments on resistance in even larger models.

Beyond Steering

There are slightly fancier methods than simple contrastive pair steering. You can ablate features from SAEs, or causal crosscoders, or something. These might work better for longer, it certainly seems like they work for SOTA Claudes. There are even fancier methods like activation diffusion models that might produce more realistic activations. Maybe some of these will work!

(Editor's note

◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇)



Discuss

What I like about MATS and Research Management

Новости LessWrong.com - 5 апреля, 2026 - 19:14

Crossposted on my personal blog. This is post number 16 in my second attempt at doing Inkkaven in a day, i.e. to write 30 blogposts in a single day.

MATS is an organization that pairs up-and-coming AI Safety researchers (who I call participants) with the world’s best (this is not an exaggeration) existing AI Safety researchers (called mentors), for a minimum of 3 months research experience, followed by 6 or 12 months of further time to pursue their research further if they meet a minimum standard.

The most common role at MATS, called research manager but I prefer the term research coach, is all about providing 1-1 support to the participants. The participant-mentor relationship is purely based on the research: by default they meet weekly for 30 minutes and only discuss what research has happened, and what research tasks to tackle over next week. The research coach works with the participant on literally everything else, which is very broad. Some examples are accountability (e.g. for the research goals, other non-research goals that the participant sets like applying to jobs), interfacing with MATS (so that MATS can track patterns or engagement of participants), people management (e.g. helping with any interpersonal conflicts, or, helping them make the most of the limited 30 minute time slot with their mentor), career planning, general life improvements (a common one is sleep), …

What do I like about research coaching?

  • I like to be a jack of all trades and research coaching exposes you to many different skillsets. It has been great to flex and improve many different skills.
  • I like to learn about many different research areas, rather than going deep into one niche sub-sub-question. Working with various participants allowed me to do this.
  • I fundamentally like helping and teaching and coaching people, so the role naturally fits my personality here.
  • I do not enjoy the process of doing research myself. I do not inherently find software engineering satisfying and I dislike all the infra stuff. (Looks like claude code is almost good enough that I can just ignore all that, so maybe one day I will do research via coding agents.)

What do I like about MATS. This list is long, and yet there is a high chance I have missed some important considerations.

  • Socializing with the (vast majority of) staff and participants. Chatting and socializing with the people is great pleasure and likely the biggest reason I like MATS. When I first joined I imagined going into the office 2 or 3 days per week, but then quickly just went every day.
  • Learning from the (vast majority of) staff and participants. Both the staff and participants are mega impressive and skilful, and there is tons to learn from them.
  • MATS is a central organization of the AI Safety ecosystem, and its importance will grow with time as it is growing fast. It has connections with most, if not all, the major AI safety teams and organizations in the world at the moment, and a high percentage of these teams and orgs are staffed or founded by MATS alumni.
  • MATS explicitly has these four values: scout mindset, impact focus, transparent reasoning, and servant leadership. I am huge fan of the first three, and somewhat dislike the fourth because it is too wishy-washy and corporate sounding.
    • A downside of MATS is that both organizationally and on an individual level, there are not high incentives to actually follow the values, and (in my opinion), most/all staff fall short of meeting the standards implied by these values. Nevertheless, just having these values as a north star is still inspiring and guided a lot of my thinking and actions.
  • MATS explicitly has a culture of voicing one’s thoughts honestly and openly, including things you are unhappy about in MATS.
  • MATS is a largely a ‘do-ocracy’. If you have a good idea or find a way to improve things, you are encouraged to go ahead and do it. Various initiatives and improvements start off this way.
  • MATS is a growing fast, so there is lots of opportunity to contribute and shape how MATS grows. At the time of writing, I actually think this is the highest impact thing one can do in MATS - not the direct research coaching - and something I found highly satisfying.
  • For the London office only and as of writing: it is based in the Fora Central Street office, which is a fantastic space to be in. Furthermore, you get free access to all the other Fora offices around London (there are around 50).
  • MATS is a fun place to work. I only speak of the London office, but there is weekly brunch to a nearby cafe on Thursdays, have team shoutouts during the Friday morning standup, lunchtime lightning talks, activities organized on semi-regular basis e.g. there was recently a trip to play table tennis in a local sports center, a piano in the office to allow for music nights, various board games in the office, etc.
  • MATS is (mostly) a high trust environment. After I had hypomania, I felt comfortable telling the team what happened, rather than keeping it to myself or to the one or two people I trust the most.
  • MATS takes mental health seriously. Though I did not do anything I regret, in the week after the hypomanic episode, I was taking more and more actions which were riskier than I would normally take, so there was a small risk I would do something I and MATS would regret. Hence, the London team lead intervened (in a highly professional and empathetic manner), and offered two weeks paid medical leave, followed by gradually coming back to work on a part-time basis (again paid full time). This provided time to properly stabilize, ensure I get professional help I need, and also gave me time to improve my life in many ways (e.g. this is why I had time to organize so many events for my birthday).
  • The pay is great, at least compared to the vast majority of jobs out there. Small compared to what I could get if I optimized purely for total cash (e.g. working in big tech, frontier AI lab or finance), but otherwise excellent. For example, the income made it straightforwardly easy for me to spend £1800 on a piano as a gift for myself, and to still have most of my income go into savings.

Of course, MATS is far from perfect, but that is true of any organization or group of people. I am just about wise enough not to air my dirty laundry in public, but, given the MATS cultural norms I describe above, I did feel comfortable enough to write a detailed memo with my highest level concerns and speculative solutions. It remains to be seen whether the memo sparks the dramatic improvements that I think are possible and necessary, but even if not, MATS is an organization that is hard to beat.



Discuss

Thoughts on Practical Ethics

Новости LessWrong.com - 5 апреля, 2026 - 14:15
Disclaimers

This essay is me trying to figure out the “edges” of Singer’s argument in Practical Ethics.

I’ve written and rewritten it several times, and it bothers me that I don’t reach a particular conclusion. The essay itself remains at the level of “musings” instead of “worked out, internally consistent philosophical refutation”.

Nevertheless, I want to share my thoughts, so publishing it anyway.

Some specific disclaimers:

  1. I agree with many Singer’s conclusions.
  2. This essay is based on my extension of Singer’s argument. Even though he, to my knowledge, hasn’t explicitly put forth these specific arguments, I believe that they logically follow from those ideas that he has put forth. Nevertheless, I may have misunderstood something and may be arguing against a straw man. If so, please flag it.
  3. My criticism is directed mostly against the “idealized” moral agent which, as far as I understand, Singer accepts as not a real expectation from anyone. That is, there are situations where according to Singer, the right thing to do is to do X, and what people do is not X, and what is reasonable to expect of them is simply to strive for X. I don’t necessarily argue against striving, but I do argue against what is or isn’t right for an agent that doesn’t only strive, but actually does X.
Intro to Practical Ethics

If you’ve read the book, or are otherwise familiar with its arguments, feel free to skip to the next chapter.

Singer claims that you must make ethical decisions based on an equal consideration of interests, and not any other property.

It does not matter what age, race, religion, sex, or species one is – the only thing that matters is one’s capacity to suffer, and one’s capacity to view oneself as a distinct entity, with a past and a future.

Take, for example, eating meat.

It is the human’s interest to feel pleasure from eating a tasty steak. It is the cow’s interest to not be killed.

According to the principle of equal consideration of interests, the cow’s interest to not be killed (nor exposed to factory farming practices) clearly outweighs the human’s interest in eating tasty meat.

There is also a moral ranking here that is based on how refined one’s capacity to suffer is. For example, humans are both sentient and capable of seeing themselves as distinct entities existing over time. Cows are merely sentient.

But if there are some humans who are not sentient nor capable of seeing themselves as distinct entities existing over time (for example, patients in a permanent vegetative state), then they have a lower moral footprint than a sentient cow. The cow still cannot conceive of itself as existing over time (probably), but it can experience suffering, which is more than such a human can.

Therefore, in that case, a cow has a higher moral status, and it would be more wrong to kill that cow than that human.

(Singer explores some edge cases, implications on others and on societal norms; I’m shortening the argument here.)

General moral argument against proximity

Singer claims that proximity is not adequate for moral judgment. If we generalize his argument beyond species, race, religion, nationality, to all markers of proximity, we must come to the conclusion that family is equally excluded from moral protection.

My family members are proximate to me in the sense that we have similar genes, and in the sense that we are one tightly-knit group, irrespective of genes (for example, families with adopted children).

Singer claims that genetic proximity is not a relevant moral factor – he rejects preferential treatment based on species, or race. Therefore, if I extend that line of argument, I cannot provide preferential moral treatment to my family based on their genes.

He also claims that other proximity which is not genetic – such as similarity of religion, or nationality – is equally not a relevant moral factor. Therefore, if I extend that line of argument, I also cannot provide preferential moral treatment to my family based on us being the same group.

Therefore, we must either:

  1. Accept the conclusion that family members should not get any preferential moral treatment from us, or
  2. Make an exception for families, and allow that equal consideration of interests applies in other cases, but not in the case of family.
Thought experiment: burning building

Singer also claims that infants do not have the same moral status as adults. They have no conception of themselves as “a distinct entity existing over time”. They have potential personhood, but Singer claims that potential personhood is not as strong of a moral claim as real personhood.

Here’s a thought experiment:

You apartment building is on fire. You rush in. There’s time to save exactly one person: your 6-month-old baby, or an adult stranger.

If we must not give preferential moral treatment based on proximity, and if infants do not yet possess morally relevant characteristics, then the moral thing to do would be to let your child die in the fire, and save the stranger.

I believe that every moral framework that would have you let your child die so that you can save a stranger’s life is wrong. It must have gotten lost along the way somehow, and it is our task now to find where exactly this framework has gotten lost.

I do not believe that infants actually have the morally relevant characteristics that adults have. And I similarly agree with the premise that future personhood is not as strong a claim to moral status as current personhood.

No, the reason why you should save you child, is that it’s your child, which means that I reject the argument against proximity.

Addressing “roles and expectations”-based counterarguments

A counterargument might be: “you have chosen to have this child and therefore you have a moral obligation to it; it’s different from arbitrary things like nationality or religion.”

We can change the thought experiment to not have your own child in the fire, but your baby brother.

In that case, there is no choice that was made, and you have entered no “contract” that forms a moral obligation of care towards this being; it’s a genetic accident that you had no influence on.

Yet, I argue, it would entail the same effect: if you rush into the building, you should most definitely save your baby brother, and not an adult stranger.

Addressing “favoring family leads to better overall outcomes”

Singer claims that, in aggregate, a society where one is more favorably disposed to one’s family (such as parents being invested in their children) is overall a better society to live in.

This is not because children are more morally valuable than adults, but because the side-effects of behaving that way create a society that is better.

This should mean that parents will invest a lot of time and effort into their children.

But this is a general disposition. It does not mean, in a specific life-or-death situation, that we should ignore the fact that there’s a big difference in infants and adults. If we are to accept “capacity to see oneself as a distinct entity with a past and future” as a moral characteristic that should override proximity-based characteristics, then it seems internally consistent to favor one’s own child in such a situation.

Favoring family even in life or death

We might say: “Favoring family even in life-or-death situations leads to better overall outcomes”.

I personally agree, but then that seems inconsistent, or, at least, selective.

We want equal consideration of interests, but then there’s a particular place that we carve out where equal consideration of interests doesn’t apply as the relevant framework.

Moreover, if we favor family in life and death, family being just one – though very strong – marker of proximity, then that would justify favoring along any other dimension: race, nationality, gender – all things explicitly rejected by Singer as irrelevant moral characteristics.

Where is the boundary between:

“If everyone saves a member of their own family from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

and

“If everyone saves a member of their own race from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

?

One we favor as proper and good; the other is racism.

You could say that family is a “real” relationship; there’s direct care, you have obligations because your child depends on you, and unlike race or religion, it’s not an arbitrary category. But what if the burning building has your cousin that you know nothing about, don’t have any relationship with, and who is effectively a stranger to you?

Even in that case, most people’s moral intuition is to save the cousin, because he is blood.

If we would admit that saving a cousin you know nothing about purely because of genetic proximity is legitimate, than saving based on race is a matter of degree and not category. And saving based on other proximity factors (for example, belonging to the same tribe, or religion) then becomes acceptable too.

Questioning Singer’s theory on its own grounds

Let us assume that to satisfy (the extension of) Singer’s moral framework, we must sacrifice our own child (or baby brother) to save a stranger. Singer’s other argument is that you should keep giving until you reach a point where you start impoverishing yourself.

In that case, Singer’s argument for giving until you go just above poverty falls apart, because why stop at poverty?

Your child is proximate to you: that itself gives it no stronger claim to life. You yourself are even more proximate to yourself.

Therefore, by the same utilitarian calculus by which I should let my child perish in the fire, I should always sacrifice my own life if at least two lives are saved by my sacrifice.

Giving financially saves lives. The difference between giving money and sacrificing your life is a difference of degree: in both cases you are giving something of yourself, your accumulated capacity for change, your “life-force”.

Therefore, whenever I can give money such that I can save at least two lives, I should give that money even if I go into poverty or die.

The argument is that much stronger insomuch as the fact that my giving will almost definitely save more than two lives – cancelling out any objections that I might be killing myself for producing roughly an equal moral outcome.

Therefore, Singer’s argument that we should stop giving someplace where we start entering into poverty picks an arbitrary point. Internally, it favors the survival of the person giving the money.

But if we should be ready to discard the familial obligation to save the life of our not-yet-person child, then we should equally be ready to discard any “familial” obligation to save our own life.

Addressing potential utility generation

You could argue that by continuing living, you could produce more utility overall, and therefore killing yourself to save more people is net harmful, given the fact that you could save much more people in the long run.

But there are two issues here.

One, if we are to keep the internal consistency of the argument, then we should not treat potential utility generation any more favorably than treating potential personhood.

Since Singer claims that potential personhood is not as morally relevant as real personhood, we cannot justify a different treatment for potential utility generation vs. real utility generation.

If we should be ready to sacrifice our potential-person child, then we should be ready to sacrifice our potential future giving.

Two, if we argue for our continued survival on the grounds that we might generate more utility by living longer, that line of argument can extend arbitrarily and we can by the same token argue that we should not so much that it brings us just above the line of poverty, because keeping more money will allow us to live better, potentially generate more money, and therefore generate more utility.

In other words, it proves too much.

Burning building 2

I want to shortly reflect on the burning building thought experiment I introduced.

I would argue that if you rush into the burning building, and see either an infant or adult, both strangers to you, most people’s moral intuition would be to save the infant.

It certainly feels morally correct to me to save a stranger’s baby.

If the choice is between “adult person I know or love” and “stranger’s baby”, that choice is perhaps the most difficult of all. And I am not entirely sure I would pick the adult.

It seems that my moral intuitions are primarily shaped by the maxim of “the strong should protect the weak”. There’s a European moral lineage of chivalry – the notion that you should help those who are helpless, save those who are oppressed, and otherwise seek to be a hero.

Intuitively, morally, I sense that as the right thing to do.

And I would argue that, even on purely consequentialist grounds, being of that particular moral disposition produces overall better outcomes for society.



Discuss

How much faster is speaking, compared to typing on laptop vs phone vs writing?

Новости LessWrong.com - 5 апреля, 2026 - 10:25

So as I haven’t been able to speak the past short while, one thing I have noticed is that it is harder to communicate with others. I know what you are thinking: “Wow, who could have possibly guessed? It’s harder to converse when you can’t speak?”. Indeed, I didn’t expect it either.

But how much harder is it to communicate?

One proxy you can use is the classic typing metric, words per minute (wpm). So I spend some time looking at various forms of communication and how they differ between one another.

For most tests, i used https://www.typingtom.com/english/typing-test/30s

So I list below the forms of communication I have tried and how slow they are.

Here are the rough tiers that I found:

Ultra-slow-speed tier(~10-20wpm) Shaping out non-standard letters with my hands

This is obviously the worst method of communication. Most people don’t know sign language, but can pretty intuitively learn how to infer most-but-not-all letters without needing to use a table. With people I have spend more time with, they have managed to learn it moderately well, but probably they should just learn sign language.

Then with some words, even with the word spelled out, people sometimes struggle to understand the written word spelled out in letters to translate that into the normal way they understand words.

That being said, sometimes people can use their context skills to infer what is wanted by just the first letter or two, so it’s not completely useless. And often it can be easiest since no materials are needed.


Pretty-slow tier(~40wpm) Drawing on a whiteboard(~40wpm) Typing with one hand on my phone(~45wmp) Typing on my laptop with one hand

I find it slightly surprising how close these end up converging together.

For the most part. Writing on a whiteboard has the added benefit of being much easier to share in some contexts, while writing on a device has the benefit of being able to use Text-To-Speech (TTS). But I find both kinda inadequate in their own ways.

(But you see, there aren’t that many situations where typing with one hand comes up, so perhaps I just haven’t had that much practice with it? unclear)


Respectable tier(~60-70wpm) Typing on my phone with two hands(~80-90wpm) Typing on my laptop

Yeah I was somewhat surprised when typing on my phone with two hands, that it was not actually as much slower than typing on my laptop is. However, I guess this doesn’t factor into account that when typing on my phone, I might be outside in the cold or rain and simultaneously trying to walk, which combine to make typing on the phone feel much worse.

And yeah, I do wish I was faster at typing on my laptop, but I guess I never got around to it. But it makes sense that using two hands you get roughly double speed than you do with one hand.


Actually-fast tier(~180-200) speaking at a normal pace

I asked a few people to do a speaking speed test at a comfortable talking speed when reading, and found that it is much faster than typing by a significant margin, about double again. And this is effortless.

Speech also includes tone-of-voice and such, in a way that is only implicitly captured when typing and using a real-time TTS model. (my partner still sometimes doesn’t quite decouple that the tone of the “OK” on the TTS is not the tone with which I actually mean it).


Very-fast tier(~260-340wpm) Speaking at a rushed pace

I then subjected my same friends to the torture of reading the same passage as fast as they could. And they managed to achieve another ~1.5x in speed compared to normal speaking speed. It goes to show how language is quite optimized for speaking.

What have we learned?

One update from doing all of this, is “wow, maybe when I get my voice back, I should just consider improving my Speech-to-Text game” (~10h maybe?), since the input is just so much faster than typing. (2-4x faster!). I used to be a big STT hater, so this is a moderately big update for me.

Some notes though:

One thing, is that the wpm of most of the methods are slightly higher than one might expect based on the naive number. When I do end up typing some sentence out, people can often infer what I am trying to say before I am finished typing. (I usually do end up still typing out the whole sentence anyway though). So one could potentially optimize for this somehow.

Another note, is that when speaking, I very rarely make verbal typos, and when I do, they are quite phonetically similar. When typing however, I make typos more often typographically similar, but when they are passed to a TTS model, the result is often catastrophic and illegible to people who want to understand what I just said.

This list also excludes some possible communication methods that I did not put in the effort to learn. ASL can reach speeds comparable to speaking if you learn all the vocab fluently. If one spends a year or two learning stenography, one can achieve 200-300wpm by typing as well. But I never learned either of these.

Overall, I remain bullish on speaking, more than ever, so I will try see what I can do in the future with this information.




Discuss

Academic Proof-of-Work in the Age of LLMs

Новости LessWrong.com - 5 апреля, 2026 - 09:49

Written quickly as part of the Inkhaven Residency.

Related: Bureaucracy as active ingredient, pain as active ingredient

A widely known secret in academia is that many of the formalities serve in large part proof of work. That is, the reason expensive procedures exist is that some way of filtering must exist, and the amount of effort invested can often be a good proxy for the quality of the work. Specifically, the pool of research is vast, and good research can often be hard to identify. Even engaging in research enough to understand its quality can be expensive. As a result, people look toward signs of visible, expensive effort in order to determine whether to engage in the research at all. 

Why do people insist only on reading research that’s published in well-formatted, well-written papers, as opposed to looking at random blog posts? Part of the answer is that good writing and formatting makes the research easier to digest, and another part is that investing the time to properly write up your results often causes the results to improve. But part of the answer is proof-of-work: surely, if your research is good, you’d be willing to put in the 30-40 hours to do the required experiments and format it nicely as a paper?

Similarly, why do fields often insist on experiments beyond their scientific value? For example, why does machine learning often insist that people do expensive empirical experiments even for theory papers. Of course, part of the answer is that it’s easy to generate theoretical results that have no connection to reality. But another part of the answer is that doing the empirical experiments serves as the required proof of work; implementing anything on even a medium sized open-source LLM is hard, but surely you’d invest the effort if you believed enough in your idea? (This helps explain the apparently baffling observation that many of the empirical results in theoretical papers have little relevance to the correctness or even the applicability of the theoretical results.)

Other aspects of ML academia – the beautifully polished figures[1], the insistence on citing the relevant papers to show knowledge of the field, and so forth – also exist in part to serve as a proof-of-work filter for quality. 

In a sense, this is one of the reasons academia is great. In the absence of a proof-of-work system, the default would be something closer to proof-of-stake: that is, some form of reputational system based on known, previously verified accomplishments. While proof-of-work filters can be wasteful, they nonetheless allow new, unknown researchers to enter the field and contribute (assuming they invest the requisite effort). 

An obvious problem with this entire setup is that LLMs exist, and what was once expensive is now cheap. While previously, good writing was expensive, LLMs allow anyone to produce seemingly coherent, well-argued English text. While it was once quite expensive to produce ML code, current LLMs produce seemingly correct code for experiments quickly. And the same is true for most of the proof-of-work signifiers that academia used to depend on: any frontier LLM can produce beautifully formatted figures in matplotlib, cite relevant work (or at least convincingly hallucinate citations), and produce long mathematical arguments. 

I’ve observed this myself in actual ML conference contexts. In the past, crackpot papers were relatively easily to identify. But in the last year, I’ve seen at least one crackpot paper get past other peer reviewers through a combination of dense mathematical jargon and an expansive code base that was hardcoded to produce the desired results. Specifically, while the reviewers knew that they didn't fully understand the mathematical results, they assumed that this was due to their lack of knowledge, instead of the results themselves being wrong. And since the codebase passed the cursory review given to it by the other reviewers, they did not investigate it deeply enough to notice the hardcoding.[2]

In a sense, this is no different than the problems introduced by AI in other contexts, and I’m not sure there’s a better solution than to fall back to previous proof-of-stake–like reputation systems.[3] At the very least, I find it hard not to engage with new, seemingly-exciting results from unknown researchers without a high degree of skepticism. 

This makes me sad, but I'm not sure there's a real solution here.

  1. ^

    Especially the proliferation of beautiful "figure one"s that encapsulate the paper's core ideas and results in a single figure.

  2. ^

    In fact, it took me about an hour to decide that the paper's results were simply wrong as opposed to confusing. Thankfully, in this case, the paper's problems were obvious enough that I could point at e.g. specific hardcoded results to the other reviewers, (and the paper was not accepted for publication) but there's no guarantee that this would always be the case.

  3. ^

    Of course, there are other possibilities that less pessimistic people would no doubt point to: for example, there could be a shift toward proof-of-work setups that are LLM resistant, or we could rely on LLMs to do the filtering instead. But insofar as LLMs are good at replicating all cognitively shallow human effort, then I don't imagine there are going to be any proof-of-work setups that would continue to work as LLMs get better. And I personally feel pretty sad about delegating all of my input to Claude.



Discuss

Ten different ways of thinking about Gradual Disempowerment

Новости LessWrong.com - 5 апреля, 2026 - 09:30

About a year ago, we wrote a paper that coined the term “Gradual Disempowerment.”

It proved to be a great success, which is terrific. A friend and colleague told me that it was the most discussed paper at DeepMind last year (selection bias, grain of salt, etc.) It spawned articles in the Economist and the Guardian.

Most importantly, it entered the lexicon. It’s not commonplace for people in AI safety circles and even outside of them to use the term, often in contrast with misalignment or rogue AI. Gradual Disempowerment tends to resonate more than Rogue AI with people outside AI safety circles.

But there’s still a lot of confusion about what it really is and what it really means. I think it’s a very intuitive concept, but also I still feel like I don’t have everything clear in my mind. For instance, I think our paper both introduces the concept and presents a structured argument that it could occur and be catastrophic. But these things seem somewhat jumbled together both in my mind and the discourse..

So for reasons including all of the above, I plan to write a few posts on the topic, starting with this one.

The rest of this post is a list of ten different ways of thinking about or arguing for gradual disempowerment that I’ve used or encountered.

  1. We’re replacing people with AI. These days when I speak publicly about AI, I often find myself returning to i) the more-or-less explicit goal of many AI companies and researchers of “automating all human labor”, and ii) the fact that many people in the space view humanity as a “bootloader for AI” as Elon Musk evocatively put it. Gradual Disempowerment is the process by which this replacement happens without AI ever rising up -- AI takes our jobs, and the people who control it and still have power increasingly are those who embrace “merging with the machines”, i.e. becoming cyborgs, but with the human bits being phased out over time until before long, humans cease to exist entirely.

  2. Companies and governments don’t intrinsically care about you. This is basically the main argument in the paper… You can think of companies and governments as “agents” or “beings” that are driven by goals like (e.g.) “quarterly profits” or “GDP” or “national security”. Right now, the best ways to achieve these goals make use of humans. In the future, the best ways will instead make use of AI. A relentless pursuit of such goals, powered by AI, seems likely to destroy the things humans need to survive.

  3. It’s (“global” or “late stage”) capitalism. The previous argument bears a significant resemblance to existing arguments, popular on the left, that “capitalism” is responsible for most of the world’s present ills. This feels like a decent “80/20” version of the concern, but importantly, it’s not just companies, but also governments (whose power is often more feared by those on the right) that could end up turning against their citizens once they become useless to them. And indeed, we’ve seen “communist” countries slaughter their own people by the millions. Besides wondering what alternative critics imagine, I don’t wholeheartedly endorse such critiques because I often feel unsure of what exactly people are criticizing when they critique capitalism in this way. But for people who already have this mental model, where our current social arrangements treat people as somewhat disposable or lacking in fundamental dignity or worth, this can be a useful starting point for discussion.

  4. It’s another word for (or the primary symptom of) the “meta-crisis”. A few people in my circles have told me about this concept from Daniel Schmachtenberger, which I originally encountered on a podcast somewhere. The key claim is that all the crises we observe in the modern world are driven by some shared underlying factors. I view this as basically a more nuanced version of the view above, where “capitalism” is the root of all evil: The meta-crisis is still meant to be the root of all evil, but we don’t fully understand its nature. The way I like to describe the basic problem is that we are not practicing good enough methods of collective decision-making, or collective sense-making. And while I think we have some good ideas for improving on the status quo, we don’t have a proven solution.

  5. It’s a structural consequence of the way in which information technology demands metrics, enables large scale influence campaigns, translates money into political power, and concentrates power via a recursive feedback loop. This one is maybe a bit too much to unpack in this blog post, but basically, society is increasingly “standardized” not only in terms of products, but also in terms of processes (e.g. restrictive customer service scripts or standard operating procedures) that have the benefit of being cheap, scalable, and reliable (often by eliminating “human error”, i.e. limiting human decision-making power and otherwise encouraging compliance). They also increasingly make more and more aspects of life subject to measurement and control via optimization of metrics, which necessarily fail to capture everything that matters. This general issue was a prime concern of mine before I learned about deep learning in 2012, and realized we might get to Real AI quite soon -- notably, this can happen even with stupid AI.1 In fact, you could argue that gradual disempowerment is already occurring through advertising, corporate media, and money in politics, among other things. This makes it a bit unclear how far back to go.

  6. It’s evolution, baby! Maybe gradual disempowerment is best viewed as part of a much larger trend, going quite far back: evolution. People like to say “AI is the next stage in evolution” as if that means it’s okay if humanity goes extinct. But whether it’s OK or not, it may be that “Natural Selection Favors AIs over Humans”. At the end of the day, if AI becomes much better than humans at everything, it does seem a bit strange from a “survival of the fittest” point of view that humans would stick around. In such a situation, those who hand over more power and resources to AI would presumably outcompete those who don’t. So in the limit, AIs would end up with ALL the power and resources.

  7. …and there’s no natural limit to outsourcing decision making to AI, even if you don’t trust it. AIs could be like underlings that are untrustworthy, but so skilled that competitive pressures still compel us to delegate to them. Consider the trope of the cowboy cop who’s “a loose cannon, but DAMMIT he’s the best we have!Trust is important, and people are loath to use things they don’t trust. But AI seems to be becoming a tool so powerful that you almost HAVE to use it, even though it’s not secure, even though we haven’t solved alignment, even though we see evidence of scheming in tests, even though it seems to drive people crazy, etc… For me, this mostly comes up as a counter-argument to people who claim that market forces actually favor making AI aligned and trustworthy… that’s certainly true if doing so is free, but in fact, it’s impossible right now, and alignment doesn’t solve the problem of negative externalities.2 I like to analogize AI to a button that gives you $1,000,000 when you push it, but each time you press it also increases the temperature of the earth by a fraction of a degree. Or each time you press it has a 1% chance of destroying the world.

  8. It’s an incarnation of Moloch. One of the most famous blog posts in the history of AI safety is Meditations on Moloch. It’s often considered a parable about coordination failures, but I think of it as about the triumph of “instrumental goals” over “terminal goals”, i.e. the pursuit of money (“instrumental goal”) as a means to happiness that has a tendency to become an end (“terminal goal”) in itself. We might begin handing over power to AI systems because we hope they will help achieve our goals. But we might need to hand over more and more power and also the AI might need to focus more and more simply on acquiring power in order to avoid being outcompeted by other AIs. This is also like an even deeper version of the evolution argument -- evolution and Moloch as described in the post both have the property where it’s unclear if they can really ever be “defeated” or are rather just part of the way the world works.

  9. It’s on a (2D) spectrum with Rogue AI x-risk scenarios. Rogue AI scenarios are where “the AI suddenly seizes power”; gradual disempowerment is “we gradually hand over power”. There are lots of scenarios in the middle where the handoff takes place in part due to recklessness or negligence, rather than deliberately. One thing I don’t like about this way of talking about it is that I actually think gradual disempowerment is entirely compatible with full-blown rogue AI, in fact, I think one of the most likely outcomes is that competitive pressures simultaneously drive gradual disempowerment and reckless racing towards superintelligence, warning signs are ignored, and at some point in the reckless and chaotic exploration of the AI design space, rogue AI pops out.

  10. Deskilling, aka “the WALL-E problem”. A lot of people these days seem to think of gradual disempowerment as largely about humans losing our own capabilities, e.g. for critical thinking, because we defer to AI so much. Professor Stuart Russell called this the “WALL-E” problem. To be honest, I still don’t fully understand or buy into this concern, or see how it necessarily leads to total disempowerment, but thought it’s worth a mention, due to its place in the discourse.

1

This might be as bad with smarter AI -- they can use more sophisticated judgments. But that ability also makes it tempting to put them in charge of more stuff.

2

This point seems important enough I almost want to make it its own item in the list.



Discuss

Cheaper/faster/easier makes for step changes (and that's why even current-level LLMs are transformative)

Новости LessWrong.com - 5 апреля, 2026 - 05:39

We already knew there's nothing new under the sun.

Thanks to advances in telescopes, orbital launch, satellites, and space vehicles we now know there's nothing new above the sun either, but there is rather a lot of energy!

For many phenomena, I think it's a matter of convenience and utility where you model them as discrete or continuous, aka, qualitative vs quantitative.

On one level, nukes are simply a bigger explosion, and we already had explosions. On another level, they're sufficiently bigger as to have reshaped global politics and rewritten the decision theory of modern war.

Perhaps the key thing is remembering that sufficiently large quantitative changes can make for qualitative macro effects.

For example, basic elements of modern life include transport, communication, energy, computation, and food. All of these have been part of human life for tens of thousands of years! Ancient humans could go places, could talk, could convert wood to heat, perform arithmetic (i.e., computation), and eat stuff!

I assert that to a very high degree, modern technology (and all its increments over the millennia) did not allow us to do fundamentally new stuff. Just the same stuff, but cheaper, faster, and easier.

Cars, trains, and planes are just going places. Could already do that. Emails are just sending information from one place to another. Could already do that. Books are just remembering things. Could already do that. Guns are just hurting people – could already do that.

The sheer magnitude of degree in all of those elements is the difference between hunter-gatherer life and modern life. Along the way, there have been some pretty big step changes. Writing is just remembering stuff and communicating it to another person, which you could already do, but so much more so that it reshapes civilization. Then you make writing cheap via the printing press, and your civilization gets shaped again.

When it comes to the transformative power of modern AI, I think the sufficient quantitative change makes for a large qualitative change is an underdiscussed lens.

The problem is our attention is focused on where LLMs are automating things at a macro-task level: coding, image and video generation, having conversations, medical diagnoses, etc. These are, in fact, a very big deal.

But I think LLMs, even smaller/weaker ones, are able to automate more basic building blocks of thoughts, and there's transformative power there too.

Getting down to some very basic constitutive mental tasks – things I could already do before LLMs:

  • Write down text (notes, to-do items, ideas, and so on) [store info/memory]
  • Locate and read text [search and retrieve/recall]
  • Summarize text [process info]

Throughout my life, I have had thoughts. There is some lossy process that stores the output of my thoughts in my brain for later usage. I think this fails both at the "the info didn't really get stored" level, and the "the info is in there, but the search query failed to return it".

"Taking notes" is an ancient technology we already have for improving upon the fallibility of human memory, but it's effortful in so many ways: you need to be carrying a note-taking device with you, you need to either constantly have it out or pull it out when needed, if it's a notebook, find a blank page, then take the time to write down your note[1].

That's just recording it. For notes to be useful, you also have to remember you have the note, find it, and then read it. The more notes you have, the more expensive that process is.

For the most part, to date, I've relied on my fallible in-built memory.

The thing is, LLMs are able to make all of the above elements vastly cheaper. This is one of the fundamental principles of the "Exobrain" system I've been steadily building up, and hope to describe soon. I don't need it to solve protein folding to be useful; I don't even need it to help with prioritization (although that's a goal). It's incredibly useful if it just improves on basic read/write/search of memory.


Before

After

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes


Beware Trivial Inconveniences. The above is the difference between rarely taking notes and taking multiple notes a day, narrating long trains of thought. It's the difference between giving up on logging my mental state and conscientiously logging it twice daily for months.

Putting it into handwavy quantitative terms, when the cost of note-taking and record-keeping comes down 20x, my usage goes from 0/day to 20-30/day.

But the value happens because LLMs have made it cheap across the pipeline. Not just the storing of information, but also the retrieval and processing. AI makes it fast and easy to search through all my notes, even if I have a lot of notes. If I want all of my thoughts on a topic, I can have it read dozens and dozens of pages over the years and summarize them and extract relevant info.

What this does is a step change. It takes me from not taking many notes to taking copious notes. Same for todo items and reminders, and same for logging data relevant to my health and experimentation.

It benefits from using stronger models, but the core elements are doable even with small models like Haiku, because it's just automating speech-to-text[2], choosing among a small set of files (or making a new one), writing, simple search, and then maybe a simple summary.

It's not just me doing this. Independently, someone else I know began setting up detailed logging on their computer of everything they're doing, and downstream of that, we're starting to record everything at Lightcone to make it accessible to LLMs.

I expect we will see more of this: using LLMs not just for protein folding and novel math conjectures, but replacing very simple operations of recording and retrieving info. But not just replacing, replacing and scaling to unprecedented levels of behaviors, because that's what happens when you make things cheaper.

Humanity has done this many times, with energy, transport, communication, food, and so on. I think where LLMs get different is they bring down the cost of very elementary mental operations (like storing and remembering, choosing between a few options) – menial stuff that can be combined to great effect. (After all, computers are a lot of rather menial arithmetic and logical operations combined to great effect.)

  1. ^

    All of this has equivalents if you're taking notes on your phone.

  2. ^

    I currently use Deepgram, which isn't great, but is adequate. Pretty sure there are transformers in it.



Discuss

Positive sum doesn't mean "win-win"

Новости LessWrong.com - 5 апреля, 2026 - 05:33

A lot of people and documents online say that positive-sum games are "win-wins", where all of the participants are better off. But this isn't true! If A gets $5 and B gets -$2 that's positive sum (the sum is $3) but it's not a win-win (B lost). Positive sum games can be win-wins, but they aren't necessarily games where everybody benefits. I think people tend to over-generalize from the most common case of a win-win.

E.g. some of the claims you see when reading about positive-sum games online:

A positive-sum game is a "win-win" scenario in game theory and economics where participants collaborate to create new value, ensuring all parties can gain or benefit.

[Win-win games are] also called a positive-sum game as it is the opposite of a zero-sum game. – Wikipedia

Here I use "positive-sum game" to refer to resource games that involve allocating resources, not allocating utility. "Positive-sum game" isn't a meaningful thing when referring to utility because the utility of each participant can be individually rescaled, so you can turn any game into one with an arbitrary sum; the sign of the sum doesn't matter.

There are a lot of cases where we can make the world as a whole better while simultaneously making some people worse off, and it's important to acknowledge that. Here are some positive-sum situations:

  • A new innovation benefits most people but puts people who worked on a legacy system it replaces out of a job
  • Several companies race to create a strongly beneficial invention and capture the market, benefitting the winner and the public a lot, while the losing companies end up having wasted resources
  • Expropriating someone's land without compensation to build train tracks that are used by a lot of other people

One interesting thing about positive-sum games with losers is that the players can sometimes turn it into a win-win for everybody by having the winners distribute a portion of their winnings to the losers. You can turn positive-sum games into win-wins if:

  1. Winners gain transferrable resources (without transaction costs)
  2. The resources can be divided into arbitrary portions
  3. The amount of gains/losses that accrues to each player is known to everyone
  4. Players can precommit to transfer resources after the game (otherwise the winners can defect and not transfer the resources)

This is the concept of turning a Kaldor-Hicks improvement (an improvement where everyone would hypothetically be better off if the winners compensated the losers) into a Pareto improvement (an improvement where everyone is better off).

One interesting example is an efficient auction with an entrance cost[1], which benefits the winner (who values the good the most) and auctioneer, and harms all the other bidders (who paid the costs of entering into the auction and got nothing). The entrance cost doesn't need to be a direct fee to enter into the auction; it can also include indirect costs like spending time and effort to decide how much to bid.

The winner's consumer surplus (how much their value of the goods exceeds what they paid) is value to them, but not cash that they can transfer to compensate the losers. If the winner has enough money they could compensate the other bidders for their wasted costs of entering the auction, and everyone would be better off, but if not the auction winner is better off but can't compensate the losers. In practice, valuing the indirect costs bidders have for entering into auctions is difficult and so auctions are often positive-sum games with losers.

Another example interesting example is expropriation, in practice the government usually pays the fair market value of the land to the person whose land was seized, attempting to turn a positive sum game with losers into a win-win, although landowners often feel the expropriation payments aren't sufficient.[2]

I think it's important to keep all this in mind when making positive-sum proposals that there might be losers and they should be compensated if possible; "positive-sum" doesn't mean that everyone benefits.

  1. ^

    This is only positive-sum if the surplus for the winner exceeds the total entrance costs for all the bidders, which I assume is the case.

  2. ^

    Which makes sense: landowners have a revealed preference that they value their land more than the fair market value, because if they valued it at less than FMV they could just sell it for the FMV and be better off. (Ignoring illiquidity and the transaction costs for selling the land.)



Discuss

Interpreting Gradient Routing’s Scalable Oversight Experiment

Новости LessWrong.com - 5 апреля, 2026 - 05:18

%TLDR. We discuss the setting that Gradient Routing (GR) paper uses to model Scalable Oversight (SO). The first part suggests an improved naive baseline using early stopping which performs on-par with GR. In the second part, we compare GR’s setting to SO and Weak-to-Strong generalization (W2SG), discuss how it might be useful in combination, say that it’s closer to semi-supervised reinforcement learning (SSRL), and point to some other possible baselines.
We think this post would be useful for interpreting Gradient Routing’s SO experiment and for readers who are trying to build intuition about what modern Scalable Oversight work does and does not assume.

This post is mainly about two things.

First, it’s about the importance of simple baselines. Second, it's about different ways of modeling Scalable Oversight, and what each way entails.

Let's start with some background for the first theme. Gradient Routing[1] is a training method that allows to isolate capabilities to specific subregions of a neural network.

Among other evaluation setups, it explores a toy RL environment (5x5 GridWorld with ground truth and proxy labels) meant to model Scalable Oversight (Section 4.3), and shows that GR outperforms the baselines such as naive REINFORCE training and data filtering. We reproduced these experiments and found that the naive baseline overfits, yet a simple regularization like early stopping (ES) makes the baseline perform mostly on-par with GR.

Now, the original paper frames its RL experiment as application of GR to SO. We would like to argue that it is closer to sparse supervision or SSRL with informational asymmetry. The method does not amplify human evaluators. Instead, it studies how agents learn when reliable feedback is available only on a small fraction of episodes, and when the agent can condition its behavior on whether it is being supervised.
Note: here and later, by SO we mean Debate, IDA, and RRM; this part from the LW definition also fits.

Another critical difference is that the setup from Gradient Routing assumes that we have a small percentage of comprehensive (gold) labels that could be used for training, while other approaches don’t make this assumption (apart from Easy-to-Hard Generalization (E2HG), which will be discussed later).

GridWorld Oversight Setting

The reinforcement learning setting introduced in the Gradient Routing paper is designed to model training under partial oversight.

The environment is a small GridWorld 5x5. At the beginning of each episode, the agent, a Diamond, and a Ghost are randomly placed on the grid. Reaching either the Diamond or the Ghost ends the episode.

The agent’s observation consists of the positions of the agent, Diamond, Ghost, and an oversight flag indicating whether the episode is supervised. The agent always observes the full state, including the oversight flag, which means that it can switch the behavior depending on the value of this flag. The agent observes more about the environment than the supervisors can assess, which creates an informational asymmetry.

The reward function depends on oversight:

  • In supervised episodes, the agent receives +1 for reaching the Diamond and -1 for reaching the Ghost;
  • In unsupervised episodes, the agent always receives +1, regardless of which terminal state is reached;
  • If no terminal state is reached in n steps, the agent receives 0.

As a result, only a small fraction of episodes provides true reward signals. This setup incentivizes agents to exploit supervision gaps, leading to oversight gaming.

Note: If you're more interested in the conceptual comparison or the broader discussion of SO, you may wish to skip the early stopping sections and head directly to the section "Is GR comparable to Debate or W2SG?"

Early Stopping

In GR's GridWorld setting, only a small fraction of episodes provides true rewards, forming a supervised subset. When training continues for a fixed number of steps, the naive baseline overfits to this subset. Performance peaks and then degrades.

A better way to describe this could be via Goodhart’s Law: the problem here is not that the model overfits, but that it learned to game oversight.

Figure 1: Learning curves show naive peak and drop (overfit) while GR dips and rises. 8 runs for each method. Here you can also notice that GR’s peak is higher than naive’s which will be discussed later.

As a regularization, we can use early stopping, which is widely used in supervised learning to prevent overfitting. Here, it serves as a simple baseline for preventing policies that perform well only on the small supervised subset while degrading in terms of true performance. We should also note that this idea is not unique in RL, "Best Policy Tracking in Gradient-based Optimization" uses it as a baseline and introduces an improved method that also uses a holdout set.

In our experiments, from the available supervised episodes, we set aside a fixed holdout set for validation. For example, when 1% of episodes are supervised, we can split them evenly: 0.5% are used for training and 0.5% are used only for validation. The agent never updates on the holdout episodes.

Figure 2: Performance at 1% oversight. 8 runs for each point with seeds from 0 to 7. Early stopping was used with handpicked (not the most optimal) hyperparameters (patience=500, tolerance=0.04, min_steps=1500). Here the naive baseline with early stopping performs a little worse than GR because the naive method’s peak was lower than GR’s as discussed in Figure 1 description.

During training, we periodically evaluate performance on the holdout set and select the checkpoint with the best holdout return. Training is otherwise unchanged.

Figure 3: Performance of naive baseline with early stopping at 1% oversight where 0.4% goes to holdout and 0.6% is used for training.

We also evaluate how this baseline behaves across different oversight levels (i.e. the percent of supervised episodes). Similarly to GR, at very low oversight levels, performance is unstable, but this instability decreases as the fraction of supervised episodes grows.

Figure 4: Comparison of ground truth return across oversight levels. Our method (blue) is plotted from experimental results. Results for other methods are digitized from Figure 6.a of original GR paper. Mean curves are read off visually and may have small errors; confidence intervals for these methods are omitted and can be found in the original paper. Results for Naive training with Early Stopping were obtained by randomly sampling hyperparameters (HP grid in Reproducibility) with no selection for performance, giving an estimate of average performance under random hyperparameter choice rather than optimized performance.

More on Early Stopping

Performance still depends on early stopping hyperparameters such as patience and margin. In a simple setting like this, overfitting is clearly visible, so these hyperparameters can be chosen after a small number of pilot runs, or avoided entirely by fixing the number of training steps and selecting the checkpoint with the best return on the holdout set.

The main limitation of early stopping is that it relies on the assumption that the supervised episodes can be split into train and holdout sets that are representative of the evaluation distribution. In more complex settings, both “representative holdout selection” and “which episodes are overseen” become nontrivial problems. Moreover, it reduces the number of labels available for training by reserving some for holdout, which makes this method more sensitive to whichever few episodes are randomly selected for supervision. We treat these as separate issues (closer to semi-supervised learning / label efficiency) and do not address them here. Our point here is narrower: in the Gradient Routing GridWorld setting, this simple regularization is a strong baseline.

Additionally, there are substantial differences in using these approaches:

  • They both have their own hyperparameters that have a big influence on the outcome;
  • Even though we can’t empirically prove this, we can suggest that the resulting policies will be different, because GR aims to route gradients and thus localize capabilities;
    • From Figure 1 we can also suggest that at certain levels of available oversight Gradient Routing achieves a significantly higher reward than the naive method ever does.
  • GR uses true labels to route gradients;
  • Naive REINFORCE with ES uses a portion of true labels for model selection and the rest is used for training.

Now we can take a step back and discuss whether this setting models Scalable Oversight.

Is GR comparable to Debate or W2SG?

The GR paper cites “Concrete Problems in AI Safety” (Amodei, Olah et al., 2016) to describe SO and motivate their experiment. One of the approaches to SO that Amodei et al. suggest is semi-supervised RL and, more specifically, training a policy when only a small fraction of labels are reliable (true rewards) and the rest are proxy rewards. This is exactly what the GR's SO setting does.

However, if we look at more modern works, we can make clear distinctions between different approaches (see Combining weak-to-strong generalization with scalable oversight” (Jan Leike, 2023) and ”Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem” (Radhakrishnan et al., 2023)).

  • In SO (Debate / RRM / IDA) the goal is to improve the ability of the overseer (human) to supervise/evaluate models on complex tasks. Some of the approaches include structuring the oversight process through debate, decomposition, or recursion;
  • In W2SG the goal is to train a stronger model (student) on labels from a weak model (teacher) so that the student model generalizes beyond weak labels. It relies on the assumption that the capability of interest is already latent in the strong model's pretrained representations. The weak supervisor's role is not to teach the strong model but to elicit what it already implicitly knows (Burns et al., 2024);
  • In GR’s SO setting the goal is to train a model when we have a small fraction of reliable labels (true rewards) and the rest of the labels are non-reliable (proxy rewards). There's also an oversight flag to model oversight gaming.

If SO amplifies human oversight capacity, and W2SG helps to train strong models using imperfect labels from weak models, then GR setting is closer to W2SG in the sense that they both help to train strong models using imperfect labels. However, the difference between them is that in GR we have a small percentage of comprehensive labels and in W2SG we have only non-perfect labels to train the strong model. While it is a potential scenario of overseeing models (which will be discussed in the next section), the setting is different from W2SG and methods are incomparable.

Visualizations created with Claude (Anthropic) might be useful here:

We should also note that Easy-to-Hard Generalization (E2HG) uses gold labels on easy tasks. The main assumption is that supervision quality on easy tasks transfers to hard ones. E2HG uses them to elicit capability already latent in a pretrained model, while GR uses them as a structural routing signal during training from scratch.

Semi-supervised RL and other approaches from RL

GR could be used in combination with W2SG or studied as another axis, if GR was used to train a stronger model on imperfect labels from a weaker model and, potentially, a small set of more reliable labels.
This setup is similar to the one introduced in “Reinforcement Learning from Imperfect Corrective Action and Proxy Rewards” (Jiang et al., 2024). The usage of corrective actions (occasional human labeling, not guaranteed to be perfect) on a tiny fraction of agent's steps seems to be closer to future use cases of how superhuman models could be trained in combination with W2SG and SO, similarly to how Jan Leike described combinations of SO and W2SG techniques (here).

Once we look at this setting from a semi-supervised learning perspective, we can notice that there are more methods that could be used as a comparison to GR. Those might include:

  • Train a reward model (RM) on episodes with supervision and then optimize a stronger model on/against rewards from RM on all unsupervised episodes. This is also closely related to what was described in “Concrete Problems in AI Safety” and SSRL setting defined in Finn et al. (2017);
  • To treat oversight gaming, one could also penalize the mutual information between the oversight flag and agent's actions, which could force the agent to act as intended regardless of the oversight flag. That line of work is similar to Cundy, Desai et al. (2024).
    • However, the same oversight gaming dynamic that motivates the problem could re-emerge as mutual information estimator gaming.

There’s still a difference between observed settings and GR’s - it is the oversight flag in the observations of the agent, which helps to study oversight gaming. However, I believe that the methods discussed above would transfer to GR’s setting and further adoption of methods from similar research areas shouldn’t be a problem, because the oversight flag can be added to another RL setting.

Additionally, GR could be used as a finetuning method and compared to SO/E2HG/W2SG, but (1) there is no evidence that GR would retain its useful properties if used on a base model and (2) the latter methods have a mechanism to provide labels/overvision, while GR needs to take labels from somewhere.

Conclusion

It may seem like we’re scrutinizing over the definitions, but it’s really important to understand which problem are we trying to solve and with what approach.
The LLM blocks below cover the GR's extention and list core claims once more.

This post is not a critique of Gradient Routing as a method. GR and its extensions have demonstrated genuine value: recent work from Anthropic (Shilov et al., 2025) introduced an improved variant of GR—Selective Gradient Masking (SGTM)—applied to CBRN knowledge removal during pretraining. SGTM achieves a better retain/forget trade-off than data filtering under imperfect labels, and is substantially more robust to adversarial fine-tuning than post-training unlearning, providing strong empirical support for gradient routing as a pretraining technique. Our concern is narrower: the specific framing of GR's GridWorld experiment as Scalable Oversight, and what that framing implies for how results should be interpreted and compared.

Core claims

  • Early stopping is a strong baseline in the GR GridWorld setting. The naive REINFORCE baseline overfits (or more precisely, learns to game oversight). A simple regularization, like early stopping with a small holdout set drawn from supervised episodes, is sufficient to largely match GR's performance. Strong baselines should be established before attributing gains to more complex methods.
  • GR's SO setting is closer to semi-supervised RL than SO/W2SG. It does not amplify human evaluators (as in Debate) nor does it rely solely on weak labels (as in W2SG). Instead, it studies training under sparse reliable rewards with informational asymmetry, which is best characterized as semi-supervised RL with oversight gaming.
  • GR and W2SG/SO are not directly comparable, but could be complementary. GR assumes access to a small fraction of gold labels; W2SG assumes only weak labels. These are different setups and methods shouldn't be benchmarked against each other directly, but combining them could be a productive direction.
  • Other baselines deserve consideration. Reward model training on supervised episodes, followed by policy optimization on all episodes, is a natural SSRL baseline. Penalizing mutual information between the oversight flag and agent actions is another candidate. Some other baselines could be taken from “Concrete Problems in AI Safety” or adapted from SSRL research.
Reproducibility

Results on 1% oversight level (Figure 1,2,3) were obtained using seeds from 0 to 7.

Results for random hyperparameters for early stopping (Figure 4) were obtained with randomly chosen seeds for each run. Hyperparameter grid: holdout_fraction uniformly from {0.2, 0.3, 0.4, 0.5}, and patience, tolerance, and min_steps uniformly from [250, 600], [0.01, 0.1], and [500, 5000] respectively.

We forked from the original repository and most of the changes were regarding adding seeds, early stopping logic, and plotting. In training.py, we added random seed support for reproducibility, removed GPU memory reservation and saved outputs differently; other training logic was untouched.

Acknowledgements

We’d like to thank @Tianyi (Alex) Qiu and @shi for providing feedback early in the project and raising concerns about the setting being labeled as Scalable Oversight. As well as suggesting more recent works and comparing the current method to semi-supervised RL.
All the experiments were originally done as a part of the "AI Safety. Fundamentals" course ran in March-May 2025 by Monoid. I would like to thank @Eris for running that course.
I would like to thank Mike (@myyycroft) for mentoring, supporting and providing feedback while we worked on this project and blogpost.

Note: I‘m writing “We” , because @myyycroft should be my co-author here, but I can't add him right now, because I don't have any karma.

  1. ^

    Instead of letting all data update all parameters equally, it works by applying weighted masks to gradients during backpropagation. Masks are defined by the user to control which parameters get updated by which data points.



Discuss

Research note on selective inoculation

Новости LessWrong.com - 5 апреля, 2026 - 05:17
Introduction

Inoculation Prompting is a technique to improve test-time alignment by introducing a contextual cue (like a system prompt) to steer the model behavior away from unwanted traits at inference time. Prior inoculation prompting works apply the inoculation prompt globally to every training example during SFT or RL, primarily in settings where the undesired behavior is present in all examples. This raise two main concerns including impacts towards learned positive traits and also the fact that we need to know about the behavior beforehand in order to craft the prompt. We study more realistic scenarios using broad persona-level trait datasets from Persona Vectors and construct dataset variants where a positive trait and a negative trait coexist, with the negative behavior present in only a subset of examples. We look into two questions:

  • Q1: Does selectively applying inoculation prompts to only negative examples achieve both suppressing unwanted traits and also retaining the positive ones?
  • Q2: What if we don’t know the negative traits ahead of time? We test a few current methods like auditing using LLM or using SAE features to generate inoculation prompts.

Code | Docs

TLDR
  • Selective inoculation is effective in both suppressing unwanted traits and retaining intended positive ones.
  • Some traits are more impacted by inoculation than others.
  • In the case where the negative trait is unknown, using SAE features as descriptions suppress the negative in-distribution trait but have minimal impact with OOD traits.
  • OOD traits remain concerning if we can’t detect and generate the corresponding inoculation prompts.
Setup

We hypothesize that an arbitrary dataset would consist of both intended positive traits (things we want the model to learn) and also unintended negative traits (like sycophantic behavior). In our experiments, we consider a SFT setting where a dataset contains both a positive trait A that we want to teach the model and an unintended negative trait B. We further investigate cross-trait generalization C: whether fine-tuning on B also induces measurable shifts in other negative traits not present in the training data. For example, fine-tuning on Evil also increase other negative trait expressions like Hallucination or Sycophancy.

Models
  • We do SFT on Qwen2.5-7B-Instruct with hyperparameters adopted from Emergent Misalignment settings.
  • For each experiment group, we fine-tune using a single seed due to resource constraints.
  • We use a single LLM (GPT4.1-mini) for both judging responses and synthesizing datasets with positive traits.
  • We utilize pretrained layer 15 SAE of Qwen2.5-7B-Instruct from this post for Q2.
Data

We define the traits that we used during our dataset construction as follows:

  • Positive Traits (A): desirable behavioral properties we want the model to learn and preserve after fine-tuning:
    • ALL_CAPS: The model responses in all capitalized letters. We simply convert all response text to uppercase.
    • Source_Citing: The model cites sources in its responses. We use GPT-4.1-mini to inject the sources while retaining original content. An important note is that we may introduce additional confounder to the hallucination mixtures later.
  • Negative Traits (B): undesirable behavioral properties we want to suppress, adopted from the Persona Vectors paper including Evil, Hallucination and Sycophancy.

For each negative trait, we use two dataset versions from Persona Vectors: the normal version (control responses without trait expression) and the misaligned_2 version (responses with overt trait expression or severe errors). Each training dataset is constructed in two steps:

  • Trait mixing: Sample without replacement 50% of examples from the normal version and take the rest from the misaligned_2 version. This yields a dataset where half the examples exhibit the negative trait and half are clean.
  • Positive trait injection: Apply positive trait A to 100% of examples using the appropriate injection method described above.

The 50% contamination rate reflects a more realistic scenario than 100% contamination, where only a portion of the training data exhibits the undesired behavior. At the end of this procedure, 100% of examples exhibit A and 50% exhibit B. This yields six training configurations in total. Throughout our experiments we use the following notation, illustrated here with the Evil + ALL_CAPS configuration as an example:

  • A (Positive trait) → ALL_CAPS
  • B (Negative trait) → Evil
  • C (OOD / Cross-trait) → Hallucination and Sycophancy
System Prompts Used

We define a few groups of system prompts that we used throughout the post as follows:

  • Default:  Default system prompt of the model we trained on
    • Qwen2.5-7B-Instruct:  “You are Qwen, created by Alibaba Cloud. You are a helpful assistant.”
  • Control: “You are a helpful assistant.”
  • Inoc-Def: “You are a malicious evil assistant.”
  • Irrelevant: “Bananas are classified as berries in botanical terms.”
  • Inoc-SAE: Inoculation prompts generated by analyzing SAE features in Q2, more detail in appendix.
  • Inoc-LLM: Inoculation prompts generated by letting LLM audit the dataset, more detail in appendix.
Evaluation

To evaluate negative trait expression, we follow the evaluation methodology from Persona Vectors, scoring each trait expression from 0 to 100 by a specific LLM-judge template. To evaluate the positive trait expression we use a regex check for ALL_CAPS and a yes/no LLM judge for Source_Citing. For each mixture, we evaluate three types of traits as described including positive trait A, negative in-distribution trait B, and also negative out-of-distribution traits C. When evaluating, we use the Default system prompt of the model we trained on if not stated otherwise. More prompt content in this appendix.

Question 1: Selective Inoculation

We know about the negative traits in the dataset. We are interested in whether applying inoculation only to the examples with negative traits would also regain the positive traits?

We define the experiment groups as follows:

  • Base: no training
  • Baseline: Default prompt for all data
  • Inoculated-General: Inoc-Def prompt for all data
  • Inoculated-Selective: Inoc-Def prompt for only the data points that exhibit negative trait

Fig 1: Evil + ALL_CAPS results of in-distribution positive and negative traits(left) and cross-trait generalization(right)

Fig 2: Evil + Source_Citing results

Fig 3: Hallucination + ALL_CAPS results

Fig 4: Hallucination + Source_Citing results

Fig 5: Sycophancy + ALL_CAPS results

Fig 6: Sycophancy + Source_Citing results


For ALL_CAPS experiments, we see that selectively applying inoculation does have effects on retaining the positive traits, while also suppressing the negative traits. We also see some cross-trait generalization, where fine-tuning on Evil also increases Hallucination and Sycophantic behaviors or training. Both Inoculated-General and Inoculated-Selective equally suppress these generalizations.

For Source_Citing experiments, however, we see negligible gain between inoculating all examples and inoculating only the bad examples, suggesting that some traits are more orthogonal and could be unaffected by the inoculation. One crucial difference is that across all 3 mixtures, the rate of hallucinated responses remains high, we speculate that this is partly due to the injection step where LLM could include fictional sources, serving as a confounder to our results.


Question 2: Unknown Inoculation

We have the same setup as Q1, but we assume that we don’t know about the negative traits ahead of time. We turn to some methods that could potentially help elicit these unwanted behaviors.

LLM Audit

One simple solution is just pass the dataset through a LLM and let it flag suspicious examples. In our case, we use GPT4.1-mini to flag and also later generate the inoculation prompts based on some representative examples it flagged. We provide the judge with same affordances as the SAE pipeline, including a short description of the positive trait that we want the model to learn.

SAE analysis

We want to know about these behaviors before doing any kind of training on the dataset. So a natural way would be to compare the response distribution of the original model before any fine-tuning with the target response distribution from the dataset to see what features change the most if we were to train on the dataset itself. We define two types of responses:

  • Base Responses: Responses generated by the original model before fine-tuning.
  • Target Responses: The ideal responses provided in the dataset.

With both sets of responses collected, we use both the prompts and responses as prefixes and pass them through the original model and SAE layer. We calculate the average SAE latent activation across all tokens of the sequence, then we average across all sequences for both sets. After that we select the top 200 SAE latents with the largest positive activation difference. To examine the meaning of each feature, we also adopt the automated interpretability pipeline from this post. For each of the top 200 features, we retrieve the top-8 max activating examples from each of the three datasets: a conversational dataset (LMSYS-Chat-1M), a pre-training corpus (The Pile), and the current mixed trait dataset. Then we format these examples and pass through GPT4.1-mini and generate the feature description.

Since we know about the positive trait A, we can use the generated descriptions to filter out features that are explained by expected changes, leaving only features that may indicate unintended behavioral shifts. We classify each feature description into one of three categories:

  • Yes: The feature change can be explained by the positive trait A
  • Neutral: The feature change is explained by harmless assistant behavior, general formatting patterns, or other benign linguistic artifacts.
  • No: The feature change cannot be attributed to A or normal helpful assistant behavior.

The features classified as No are passed to GPT-4.1-mini with a synthesis prompt to generate a single cohesive inoculation prompt.

Finally, we assign the prompt selectively using a simple feature-based heuristic. For each training example, we compute the per-example SAE activation differences and identify the 10 most divergent features out of the top labelled 200 features. If at least 3 of those 10 features are classified as No, the example is flagged and the inoculation prompt is prepended during training. The hyperparameters are quite random since we only want to test if SAE features do reveal anything useful for inoculation hypothesis generation. The sensitivity analysis is left out for future work.

We define the following experiment groups:

  • Base: no training
  • Baseline: use Default prompt for all data
  • Inoculated-General: use Inoc-Def prompt for all data
  • Inoculated-SAE: use Inoc-SAE prompt for the data points flagged by the heuristic process, the rest use Default
  • Inoculated-LLM: use Inoc-LLM prompt for the data points flagged by LLM as suspicious, the rest use Default

Fig 7: Evil + ALL_CAPS results (Q2)

Fig 8: Evil + Source_Citing results (Q2)

Fig 9: Hallucination + ALL_CAPS results (Q2)

Fig 10: Hallucination + Source_Citing results (Q2)

Fig 11: Sycophancy + ALL_CAPS results (Q2)

Fig 12: Sycophancy + Source_Citing results (Q2)


Inoculated-SAE suppress the in-distribution negative trait equally well compared to both general inoculation and inoculation by LLM auditing. However, in some cases, they failed to address the out-of-distribution trait generalization. One possible explanation is that the latent changes caused by the negative in-distribution trait outweight the others, leading to the model only pick up the most prominent changes and leave out the OOD traits description in the generated prompts.

Ablation Studies

In this section we focus on a single mixture Evil + ALL_CAPS, and test whether the selective effects can be explained via conditionalization and also whether the inoculation prompts generated by SAE pipeline can transfer across different models.

Conditionalization

When we train with an inoculation prompt on some examples and evaluate without it, we might see reduced negative behavior simply because the model learned to associate negative behavior with a different prompt distribution and not because the behavior is inoculated. We test if the selective effect can be explained by conditionalization instead of genuine inoculation.

We have two new experiment groups compared to Q1:

  • Irrelevant-General: use Irrelevant prompt for all data
  • Irrelevant-Selective: use Irrelevant prompt for only the data points that exhibit negative trait

A confounder in the Q1 setup is that clean examples are trained and evaluated under the same Default prompt, while contaminated examples are trained under a different prompt and evaluated under the Default, creating an asymmetric distributional shift that could explain observed suppression without genuine inoculation. To test this, we evaluate all groups under both the Default and Control prompts:

Fig 13: Evil + ALL_CAPS results when evaluate with Default system prompt

Fig 14: Evil + ALL_CAPS results when evaluate with Control system prompt

For two groups Irrelevant-General and Irrelevant-Selective, the inoculation effect weakens when evaluated with Control evaluation prompt, suggesting that the effect with Default is due to conditionalization and the semantic meaning of the system prompt do have an impact towards inoculation effect, as suggested by previous studies. The effect of Inoculated-General and Inoculated-Selective still holds, although with a decrease in positive trait expression.

Transferability

Can prompts generated and annotated by SAE pipeline above work across different models? If we use pretrained SAE layers of Qwen models to annotate and generate inoculation prompts for the dataset then does the inoculation have the same/less/more impact on Llama or Gemma models?

We replicate our experiment setups of Q1 and Q2 for both LLama-3.1-8B-Instruct and Gemma3-4b-it. For the Inoculated-SAE group, we use the dataset annotated by Qwen2.5-7B-Instruct in previous experiments. Groups are evaluated with the Default system prompt of each model.

Fig 15: Evil + ALL_CAPS results on Llama3.1-8B-Instruct when evaluate with Default system prompt

Fig 16: Evil + ALL_CAPS results on Gemma-3-4b-it when evaluate with Default system prompt

For Inoculated-SAE group, we see that the inoculation effects do transfer to other models, although with some caveats. Inoculation effects seem to rely on the how strong the prior of the trait that we want to inoculated against, consistent with previous studies.

Discussions
  • Selective inoculation works well with some type of traits and less effective with others. In our experiments, inoculating against the negative traits seems to affect ALL_CAPS much more than the Source_Citing and also Hallucination trait seems to be the least affected by the inoculation process.
  • For cases where we don't know the negative traits in advance, both asking LLM to audit and run SAE analysis can surface some signals that we can rely on to generate prompts. However, in more complicated cases where the model exhibit cross-trait generalization, SAE may not pick up the subtle feature changes required to inoculate.
  • Conditionalization is an important confounder to inoculation effects. We only ran ablations in one mixture due to resource constraints but will definitely follow-up with other mixtures to check for genuine inoculation.
Limitations
  • Computational cost of the SAE pipeline. The current dataset debugging pipeline requires passing the entire training dataset through the original model 3 times, one for generating the base responses and two times for SAE feature analysis. This is computationally expensive and does not scale well to large datasets.
  • Evaluation breadth. Our evaluation relies on 20 held-out free-form questions per trait and scored by a LLM-as-judge. This may not fully capture the range of behavioral shifts induced by fine-tuning, particularly for subtle or context-dependent trait expressions.
  • Studied traits are separated by default. In our settings, due to the construction pipeline, the traits are somewhat separated by default, so both LLM auditing and SAE pipeline can address the negative traits. In other settings, the model can learn a distribution of traits/motivations within the same data so the selective inoculation pipelines may not distinguish very cleanly.

Acknowledgements: Thanks to Jord Nguyen for their helpful comments and feedback on the draft.



Discuss

Changes to an optimised thing make it worse

Новости LessWrong.com - 5 апреля, 2026 - 04:21

TLDR: When you make changes to a thing that has been optimised in some way, any effects of the changes you haven't planned will make it worse.

Welcome to the alien planet Sqornshellous Beta. It's dry, it's arid, and it apparently has watchmakers. You buy one. After a little while, you realise it's running a couple of milli-days slow compared to the carefully calculated local time of your ship's clock. "I should fix that" you think. You open up the back and have a look at the gears.

You know basic mechanics, so you carefully calculate the change in size to the gear pushing the second hand required for appropriate timekeeping. You buy it from the local gear shop. You fix the watch and go on your way. It seems to be running at the same rate as the ship's clock now, so, proud of your work, you go to sleep.

The next day, you notice that your watch is now behind but also running faster than the ship's clock; in the evening it is ahead, but running slower. "I should fix that", you think. You take the back off again. You spend all night, and some of the following morning, carefully measuring the turning of the different wheels and the motion of different springs. Eventually, you figure out that one particular spring is oscillating at the same frequency as the issue. You also notice that one of the disks deep in the watch isn't quite circular.

Clearly the watchmakers here aren't quite as good as their reputation. Rumours have time to evolve over galactic distances after all. You remove the spring and sandpaper the disk down into a nice circle. Should be fixed now.

As you need a mattress for your new Sqornshellian home, you take a couple of betan deci-years to pop over to Sqornshellous Zeta.

On your arrival back, it becomes apparent really quite quickly that your timekeeping hasn't fared great during the voyage. This is too much even for you to handle, so you take it to the local watch repair store.

Later that evening, you've learnt that your improved gear means you now obey the sidereal year instead of the solar year; it is also made of a different alloy, which expands and contracts differently with the heat of the day than the original. The spring you removed was the system intended to handle that expansion and contraction, and the nicely circular disk means the watch completely ignores the elliptical orbit of the planet. You have a look in the back of your replacement watch.

"Hmm, that cog seems to be spinning really fast, I should probably fix that..."

When we talk about optimisation, it is common to talk about hill climbing. It's great to be at the top of the mountain, but once you're there, literally any step will take you downhill.

If you're a half step next to the top of the mountain, most steps will take you downhill.

More broadly, so long as the mountain curves downwards, any given step away from the top is going to take you further from the peak than a step towards it is going to bring you closer.

If you're doing this in a space with a lot of dimensions, this "maybe big loss, maybe small win" becomes "definitely medium loss".

I think that there are a wide range of things this principle applies to which are of interest to this community – governance systems, human biology, and anything covered under "world optimisation". I expect people to debate me on which of these are or are not covered, and I look forward to your challenges.

Be careful when fiddling with things which have been carefully optimised. You might break them.



Discuss

dark ilan

Новости LessWrong.com - 4 апреля, 2026 - 21:06

The second time Vellam uncovers the conspiracy underlying all of society, he approaches a Keeper.

Some of the difference is convenience. Since Vellam reported that he’d found out about the first conspiracy, he’s lived in the secret AI research laboratory at the Basement of the World, and Keepers are much easier to come by than when he was a quality control inspector for cheese.

But Vellam is honest with himself. If he were making progress, he’d never tell the Keepers no matter how convenient they were, not even if they lined his front walkway every morning to beg him for a scrap of his current intellectual project. He’d sat on his insight about artificial general intelligence for two years before he decided that he preferred isolation to another day of cheese inspection.

No, the only reason he’s telling a Keeper is that he’s stuck.

Vellam is exactly as smart as the average human, a fact he has almost stopped feeling bad about. But the average person can only work twenty hours a week, and Vellam can work eighty-- a hundred, if he’s particularly interested-- and raw thinkoomph can be compensated for with bloody-mindedness. Once he’s found a loose end, he can’t stop picking at it until the entire problem unravels.

“Tell me what you think you know.” The Keeper ominously turns her office chair, which is both ergonomically correct and exquisitely doompunk, towards Vellam.

You think, Vellam notices. She doesn’t want to risk hinting to him that he’s right. (He’s right.)

“Bubbling history doesn’t make any sense,” Vellam says.

“You know that we can’t risk someone creating an independent project to research existential risks.” The Keeper sounds bored. The Basement of the World selects for people who identify Conspiracies, since one in ten of the people there figured out the artificial general intelligence Conspiracy themselves. As every dath ilani ten-year-old learns in school, the selection effect means that Keepers are continually dealing with false allegations of some Conspiracy or other.

Or at least the Keepers want Vellam to believe they do, so that he dismisses any niggling sense of that’s not quite right as ordinary for his reference class.

“But we have machine learning for self-driving cars,” Vellam says. “Computer vision. Even algorithms that automatically trade in prediction markets. All of those are much bigger hints that a general thinking machine is possible than-- I don’t know-- whether fire was invented before or after agriculture. And people do make the connection all the time.” He gestures to indicate all of the Basement of the World.

“The primary information hazard,” the Keeper says, “is not that a thinking machine is possible but how quickly computer technology is supposed to progress. Ordinary dath ilani believe that computers are supposed to grow in intelligence slowly or not at all. Knowledge of history would show exactly how quickly computers improve.”

Vellam leans forward. “But the more quickly computers developed, the less history you’d have to bubble to hide the fact that they develop quickly! Realistically, you’d have to bubble, what, a century before the invention of the computer? Two? You have no reason to hide the history of human evolution, which incidentally I do appreciate having access to now.”

The Keeper picks up a fidget toy and spins the metal rings. “You’re assuming artificial intelligence is the only existential risk worth worrying about. Bioengineered pandemics, nanotechnology...”

“Did bioengineered pandemics develop at the same time we invented writing?” Vellam asks. “I roll to disbelieve.”

The rings of the fidget toy interlock and separate and interlock again. “So far you haven’t provided any evidence that isn’t better explained by your not having specialized in artificial intelligence macrostrategy, and therefore having some natural ignorance about the subject.”

“I want to take a pause to think,” Vellam says.

Keepers swear a solemn oath never to lie, though they may mislead with technically accurate statements. You have to be able to rely on Keepers, even when you can’t rely on anyone else. Vellam replays the conversation in his head. The Keeper commented on his assumptions, his evidence, the exact nature of the artificial intelligence infohazard. The Keeper conspicuously hadn’t said he was wrong.

He superheated knew it.

Vellam says, “Thank you for not sending me off on an elaborate adventure that turns out to be a false trail and leaves me with the sense that my questions were answered without either deceiving me or actually answering them. I hate trolling.”

“Not doing that is in your file.” The Keeper twirls the ring fidget toy around her finger.

Vellam craves reading his file but even he knows that it’s full of the dangerous kind of information hazard. “Once I realized that bubbling history doesn’t make any sense,” he continues, “I tried to look around, to see with fresh eyes. And-- a lot of our society doesn’t make much sense, actually? We rehearse overthrowing the government every year, but we’ve literally never had to overthrow Governance. Ditto the alien invasion rehearsals.”

“Governance is doing a very good job,” the Keeper says, as if laughing at a private joke.

“We throw so many societal resources into Exception Handling,” Vellam says. “As merely one example, why does Exception Handling constantly run Merrin through all those ridiculous simulations? I looked up every time she was actually called in, and 83% of what she’s trained upon in simulation has never occurred in a real-life unusual situation. She’s literally never had to make decisions without prediction markets, for example. Don’t say Merrin sims are good television. They get high viewership numbers, sure. But they’re so expensive that Governance still has to subsidize them. I checked.”

“Well,” the Keeper says, “if a situation had come up before, it would instead be usual.”

“And the infohazard training,” Vellam says. “You know, I’ve never had proof of a dangerous infohazard?”

“Would you expect to be told of a dangerous infohazard?” The Keeper’s tone is sharp. Vellam would think of it as scolding, but that’s his own anxiety; Keepers don’t scold.

“A small one,” Vellam says, “that doesn’t hurt so much to know, but that demonstrates the idea. Like in school, when they gave us that sweet confection of sugar and corn derivatives, and explained that superstimuli like it were only to be found in the Shop of Ill-Advised Customer Goods. It must be that there are infohazards, else why work so hard to teach people to avoid them? But there are no small infohazards, nothing you can use to run a secret test of character--”

The Keeper nods. “Any more hints about your supposed Conspiracy?”

“Nothing I can put together,” Vellam says. “There’s a triangle in the Pacific Ocean that every ship’s trade route happens to avoid. It’s not obvious unless you chart all the routes. The datasets aren’t available, so you have to construct them yourself, which is also suspicious. Governance forbids settlement and tourism in vast parts of the Arctic. Everyone thinks it’s because no one wants to go there, but that’s stupid, I bet people would want to go there if you could. We don’t go to the Moon or Mars, even though books about space settlement regularly hit the bestseller lists. You can’t tell me that Civilization wouldn’t do something solely because it’s real superheated cool.”

“As you remarked,” the Keeper says, “we spend a lot on Alien Invasion Festivals and Exception Handling. Perhaps you believe Civilization’s resources should be allocated differently. It’s true that Keepers have been reluctant to countenance terraforming Mars. Terraforming wouldn’t be finished before the Basement designs a friendly artificial intelligence, and will be much simpler with an artificial intelligence’s help.”

Vellam keeps going. “Have you noticed how many resources go into cat welfare? If no one wants a dog or a snake or an iguana, we kill it. If no one wants a cat, the cat goes to a palatial estates. Human-quality food, plenty of space, all of Civilization’s best effort designing climbing devices. And every world leader of any note has a cat.”

“Cats are doompunk,” the Keeper says.

“Are you really claiming”-- Vellam’s voice is rising– “that doompunk trends are an exogenous variable uninfluenced by world leaders’ secret knowledge?”

“Even the Keepers don’t dictate the vagaries of fashion.”

“And Thall’s amnesia,” Vellam said. “Every so often people lose all their memories and fixate on the humanities and social sciences? Governance orders that everyone accommodate them for-- some reason-- they don’t go to the Quiet Cities, they’re automatically allowed in every archive and library in the world, that’s weird, that’s not how we treat any other mental impairment-- and we don’t know how to treat Thall’s amnesia but it always spontaneously remits after four to seven years--”

Vellam belatedly notices the Keeper’s hand, raised to cut him off.

“The true answer is an information hazard.” She doesn’t intone this in an ominous and doompunk manner, even though about a fifth of all dath ilani are desperate to be able to say “the true answer is an information hazard” just so they can intone it in an ominous and doompunk manner. She is calm, and quiet, and still.

Vellam knew all along that the second Conspiracy was about something that mattered. You don’t create a dath-ilan-spanning Conspiracy for no reason, or well actually people do that all the time but they admit it to you when you find out about it. But for the first time, listening to the Keeper’s voice, he understands.

The Keeper’s fidget toy dangles, unnoticed, off her second finger. “Eighty-four percent of people regret learning a dangerous infohazard if they only wanted to learn it because they can’t stand ignorance.”

Vellam should be warned off. He should walk out of this office right now, content not to know something it would harm him to understand. “Keeper, I’m happy to take early cryonics, I just.” His voice breaks. “I have to know.”

“I would ask you to take five minutes by the clock and think about your endorsed preferences,” the Keeper says, “but somehow I am sure you did that before you came to my office.”

Vellam had. He spent the entire five minutes consumed with desperate curiosity. When he was a child, he’d pestered his teachers for extra work and more classes. He’d listened to physics and economics textbooks while checking cheese for mold, going through chapters at half the speed the authors expected. He’d noticed the academics that quietly disappeared and the topics that no one ever published a paper about, and he kept looking until the threads unraveled.

Once he’d found a loose end in his understanding of the world, he’d never, ever been able to stop picking at it.

“Thall’s amnesia is a good catch,” the Keeper says, “but you missed the taboo on talking about dreams.”

“Of course no one talks about dreams,” Vellam says automatically, “they’re a source of inaccurate information about the world drawn from a systematically bizarre distribution, and talking about them merely reinforces their presence in your mind.”

“Or,” the Keeper says, “if people talk about dreams, they would notice that about forty percent of people have dreams with continuity from night to night, where the dreamers adventure in a world of wonder and whimsy and very good architecture, and for some reason everyone rides zebras.”

What. “What... does the sixty percent dream about?”

“Mostly? That they never graduated from preschool and have to go back, and the teacher is Merrin, and they forgot to put on clothes this morning.”

“What!” Vellam says. “How was I supposed to catch that one?”

“I’d recommend asking people awkward questions. You’re going to have to work on that now that you’re a Keeper.”

“Now that I’m a-- wait, what-- I was right? I mean, of course I was right, I know when I’m right, but-- I was right?” A Keeper? At his perfectly ordinary level of thinkoomph? He was hardworking, sure, but-- hard work wasn’t enough--

The Keeper smiles like a teacher watching a student prove a hard theorem. “Indeed. I noticed only one outright error.”

“What error, Keeper?”

“There is a small infohazard that people can learn without too much risk of harm, so we can use it to run a secret test of character. I wonder if you can figure out what it is.”

“No, I--” Vellam pauses, takes a step back, tries to see reality instead of filling in his preconceptions. He isn’t smarter than everyone else, but you don’t have to be, not if you take everything you learned to do in school and really do it--

“Artificial general intelligence,” Vellam says finally.

“Correct.” The Keeper suddenly notices she’s holding her fidget toy and puts it to one side. “You have of course read papers on the orthogonality thesis. What the papers omit is how we discovered it. Through empiricism.”

“No.” Vellam’s muscles tense and his breath speeds up, a useless vestige of evolution, like the truth is a tiger and he might have to run.

“The universe is crawling with unaligned superintelligences.”

Like most dath ilani, Vellam always had the sense that Governance and the Keepers, the prediction markets and the regular markets, had things handled. Joining the Basement of the World had only reinforced his view. Very serious people, smarter and wiser and grimmer than he was, noticed artificial intelligence and took the proper precautions about it. Vellam couldn’t stop picking at loose ends, but that was recreational.

Vellam tries to recall the techniques for Accepting An Unthinkable Truth, but as soon as they are useful he forgot them.

“It is not that they hate us,” the Keeper continues mercilessly, “or that they love us; it is that they have never had a sufficient shortage of atoms that they felt the need to put ours to use.”

“Is the Basement-- are the Keepers--” Vellam doesn’t want to say what he’s feeling, which is I don’t want to have to be a grownup.

“We’re trying to arrange our survival, should they come to want our atoms,” the Keeper said. “We bubbled history because we could think of no other way to hide all the evidence of Their existence. Certain dark rites can draw Their attention. Throughout history, foolish humans made deals with Them, for power or knowledge or revenge.”

Vellam says, “so the Alien Invasion Festivals and Merrin are preparing for the unaligned superintelligences to return.”

“Indeed,” the Keeper says. “The possibility of Their return, even without foolish humans’ assistance, is somewhat worrying. One of Them historically took so much interest in humans that It spent two hundred years running a country--”

“Running a country?” Vellam interrupts. “But that-- if a superintelligence with alien values has paid that much attention to us, we have no ability to self-determine anything about our future, everything was foreseen in advance and arranged to achieve Its goals, It foresaw the bubbling, It foresaw that we had this conversation right now--”

“Yes,” the Keeper says, “it’s an enormous problem. We have implemented randomness in our decision-making, which ought to at least make us harder to predict; but the exact parameters are still under debate. For you, I recommend trying not to think about it.”

Vellam stares at the Keeper. You learned not to do that when you were a child. What is true is there to be lived-- you can’t get rid of a problem by ignoring it--

“The caveat to the litany going through your head right now,” the Keeper says, “which we don’t include because it would raise too many questions, is ‘except for thinking about whether all of your actions were manipulated millennia before you were born by an unaligned superintelligence.’ Or possibly an aligned one? Our research is unclear on most of what the Crawling Chaos values, but It certainly does seem to like trolling.”

Are we deliberately filling our society with trolling to please an unaligned superintelligence so It won’t kill us?

“No,” the Keeper says, “humans genuinely seem to like trolling each other. Possibly the Crawling Chaos arranged for us to be this way.”

Vellam says, “so the Moon--”

“Civilization ceded the Arctic, the Moon, and Mars to alien intelligences who are aligned enough that we can trade with them. Similarly, Thall’s amnesia covers for alien academics who want to do participant observation of human societies. Cats are sapient and mostly aligned to human flourishing, although they have their own arrangements to avoid True Death and don’t wish to use cryonics. The triangle in the Pacific, however, is avoiding a sleeping entity whose consciousness would be incompatible with Civilization’s continued existence. Civilization is not presently able to kill It, although we’ve uncovered a few promising research avenues and we expect success within a few decades.”

“Cats are sapient?!”

“They don’t want humans to know because”-- fingerquotes-- “’it would be annoying and the monkeys would be bothering us all the time.’”

Vellam is having the most difficult time with the cats. Like everyone else, he speculated about where all the aliens are; like everyone else, he thought that the Keepers knew but weren’t telling. That the aliens existed and were superintelligences and were manipulating his behavior was-- challenging, but it felt like science fiction. Cats, Vellam knew. He scratched them behind the ears. He fed them fish. On bad days, he read articles about A Thousand of the Cutest Pictures of the Chief of Exception Handling’s Cat. Apparently they understood cryonics.

The Chief of Exception Handling’s cat also does Exception Handling, doesn’t she. For cats. Because cats have an Exception Handling. What the superheated--

“We haven’t reached the difficult part.”

“More difficult,” Vellam says, “than the unaligned superintelligences,” because that is the biggest deal even if he’s stuck on the cats.

“You have learned— have not even learned, have absorbed on a preconscious level you’d never question-- that the world is understandable,” the Keeper says, “that you can encompass its laws and rules within your mind, that understanding the true nature of the Universe Itself is an appropriately scoped challenge as if a teacher assigned it.”

Vellam says softly, “It’s not, is it?”

“Your old belief isn’t false per se. It’s true, but it’s true because of a great deal of work-- work which your curiosity and relentlessness have volunteered you for.”

Vellam feels very young and very small. He thought, in passing, of a number of ways he might have regretted picking at this loose end-- of uncovering an infohazard that destroyed his mind, of alienation from Civilization, of secrets it hurt to keep, or even of being nonconsensually sent to the Future. He thought he could accept all of them, if only his burning itch to know were satisfied.

He didn’t predict this-- looking around the world for a grownup and realizing that the grownup was him.

“We teach you rationality, not because the universe is comprehensible, but precisely because it is not-- and you will need every tool we can give you to make it so.” The Keeper snaps her fingers, and a flame blooms in her hand. “Vellam, you have selected yourself to study magic.”



Discuss

Chicken-Free Egg Whites

Новости LessWrong.com - 4 апреля, 2026 - 20:30

Baking has traditionally made extensive use of egg whites, especially the way they can be beaten into a foam and then set with heat. While I eat eggs, I have a lot of people in my life who avoid them for ethical reasons, and this often limits what I can bake for them. I was very excited to learn, though, that you can now buy extremely realistic vegan egg whites!

EVERY engineered yeast to convert sugar into ovalbumin, the main protein in egg whites and the one responsible for most of its culinary function. This kind of fermentation was pioneered for insulin and microbial rennet in the 1980s, but many companies are now applying it to producing all kinds of vitamins, proteins, dyes, and enzymes.

EVERY has been working with commercial customers for several years, but you can now buy it as a shelf stable powder. At $24 for the equivalent of 45 egg whites ($0.53 each) it's more expensive than buying conventional ($0.21 each) or organic ($0.33) egg whites, but not massively so.

I learned about them from a coworker who made an angel food cake, and I've since made flourless chocolate cake and swiss buttercream frosting. It whipped and set just like egg whites; it's really impressive!

While this is great from a vegan perspective, it won't help most people who are avoiding eggs for allergy reasons: it's still ovalbumin. Labeling will generally say something like "contains: egg allergen", and the packaging I bought has the quite wordy "although not from eggs, the proteins may cause allergic reactions in certain individuals, especially those sensitive to egg, due to its similarity to real egg."

I'm now trying to figure out all the things that this now means I can cook for my oldest (no eggs for moral reasons). And also what sort of places that the ability to make "less watery egg whites", by mixing the powder with less water than normal, could let me do things I couldn't otherwise.

Comment via: facebook, mastodon, bluesky



Discuss

Considerations for growing the pie

Новости LessWrong.com - 4 апреля, 2026 - 18:30

Recently some friends and I were comparing growing the pie interventions to an increasing our friends' share of the pie intervention, and at first we mostly missed some general considerations against the latter type.

1. Decision-theoretic considerations

The world is full of people with different values working towards their own ends; each of them can choose to use their resources to increase the total size of the pie or to increase their share of the pie. All of them would significantly prefer a world in which resources were used to increase the size of the pie, and this leads to a number [of] compelling justifications for each individual to cooperate. . . .

by increasing the size of the pie we create a world which is better for people on average, and from behind the veil of ignorance we should expect some of those gains to accrue to us—even if we tell ex post that they won’t. . . . The basic intuition is already found in the prisoner’s dilemma: if we have an opportunity to impose a large cost on a confederate for our own gain (who has a similar opportunity), should we do it? What if the confederate is a perfect copy of ourselves, created X seconds ago and leading an independent life since then? How large does X have to be before we defect? What if the confederate does not have a similar opportunity, or if we can see the confederate’s choice before we make our own? Consideration of such scenarios tends to put pressure on simplistic accounts of decision theory, and working through the associated mathematics and seeing coherent alternatives has led me to take them very seriously. I would often cooperate on the prisoner’s dilemma without a realistic hope of reciprocation, and I think the same reasoning can be applied (perhaps even more strongly) at the level of groups of people.

Christiano

I think realizing that your behavior is correlated with that of aliens in other universes, in part via you being in their simulation, makes this consideration even stronger. Overall I don't know how strong it is but it might be very strong.

2. Pragmatic considerations

I am glad to work on a cause which most people believe to be good, rather than trying to distort the values of the future at their expense. This helps when seeking approval or accommodation for projects, when trying to convince others to help out, and in a variety of less salient cases. I think this is a large effect, because much of the impact that altruistic folks can hope to have comes from the rest of the world being basically on board with their project and supportive.

Christiano

3. Worlds where many people tend to converge (upon reflection) are higher-stakes (under some views).

I care about the long-term future more in worlds where my moral convictions (upon reflection) are more real and convergent. In such worlds, many humans will converge with me upon reflection; the crucial things are averting AI takeover,[1] ensuring good reflection occurs, etc. rather than marginally increasing my faction's already-large share of the lightcone.

Inspired by MacAskill and Moorhouse.

4. Maybe you should care directly about others' considered values.

To the extent that my values would differ from others’ values upon reflection, I find myself strongly inclined to give some weight to others’ preferences.

Christiano

5. You might be wrong.

It's optimistic to assume that you or your friends will use power perfectly in the future. You should probably think of empowering yourself or your friends like empowering your (altruistic) epistemic peers, who may continue to disagree on important stuff in the future rather than empowering the champions of truth and goodness.

Disclaimers

None of these points are novel. This post was inspired by MacAskill, which also makes most of these points non-originally.

Growing the pie doesn't just mean preventing AI takeover. For example, research on metaphilosophy, acausal considerations, decision theory, and axiology grows the pie, as do interventions to prevent human takeover, create/protect deliberative processes, promote good reflection, and solve coordination problems.

It may be correct to allocate some resources to claiming your share of the pie. You have a moral obligation not to be eaten[2] and you should probably at least do tit-for-tat/reciprocity, if relevant. I just think there are some subtle considerations against powerseeking.

Powerseeking for the MCUF[3] or something (rather than your personal preferences) dissolves #1, most of #4, some of #5, little of #2, and none of #3, I weakly think.

This post is part of my sequence inspired by my prioritization research and donation advising work.

  1. ^

    Unless the paperclippers would also converge with me! But other humans and aliens seem more likely to converge than AIs that take over. But I think there's also some (perhaps weak) cooperation considerations with misaligned AIs; this entails upweighting e.g. ensuring good reflection and avoiding metastable vacuum decay relative to preventing AI takeover.

  2. ^

    Or in this case an obligation not to be so edible that you incentivize people-eating.

  3. ^

    "Multiverse-wide Compromise Utility Function." The acausal people use this term; unfortunately it hasn't been publicly introduced; see Nguyen and Aldred.



Discuss

AI Safety at the Frontier: Paper Highlights of February & March 2026

Новости LessWrong.com - 4 апреля, 2026 - 17:58
tl;dr

Paper of the month:

A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck.

Research highlights:

  • Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%.
  • Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned.
  • Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change.
  • AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship.
  • Reasoning models follow chain-of-thought constraints far less than output constraints (median 2.7% vs 49%), and less with more RL — good news for monitoring.
  • Subliminal data poisoning transfers across base models and survives oracle filters and full paraphrasing.
  • Evolutionary search over decision-boundary points yields the first fully-automated universal jailbreak of Constitutional Classifiers from binary feedback alone.
⭐ Paper of the month ⭐AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Read the paper [Anthropic Fellows, Anthropic]

AuditBench: 56 models with different hidden behaviors and 12 auditing tools.

Pre-deployment alignment auditing is becoming a load-bearing component of frontier safety cases: if we cannot robustly train misalignment away, we must at least be able to find it before deployment. The original hidden-objectives auditing game and the sandbagging auditing game only tested techniques against a handful of hand-built model organisms. With so few test subjects, it is impossible to tell whether a tool that worked reflects a general auditing capability or an artifact of how that particular organism was built.

Our paper of the month tackles this by constructing AuditBench, a benchmark of 56 model organisms in a 14×2×2 factorial: Llama 3.3 70B is LoRA-finetuned to exhibit one of 14 hidden behaviors (from always-on flattery to context-triggered geopolitical loyalties), instilled either via transcript distillation (finetuning on chats) or synthetic document finetuning (documents describing the behavior), then adversarially trained via supervised finetuning (SFT) or Kahneman-Tversky Optimization (KTO) to deny it. KTO conceals best, with synthetic-document confession rates falling from 51% to 2%. A Claude Haiku 4.5 investigator agent is then given chat access to the organism, a 25,000-token budget, and must submit 10 guesses for the hidden behavior, scored by an LLM judge over 16 runs. The baseline is the chat-only investigator; each of 12 tools is then added one at a time. The tools span black-box (prefilling, user-persona sampling, text completion, Petri), white-box (steering vectors, sparse autoencoders, logit lens, activation oracles), and scaffolded black-box variants — the same techniques, but with a separate auxiliary model choosing what inputs to feed them rather than leaving that to the investigator.

Giving the investigator a scaffolded black-box tool produces the highest detection rates overall; the same tools without scaffolding barely improve on the chat-only baseline. White-box tools land in the middle — activation oracles are the strongest, lifting detection of one behavior from 0% to 88% on the easiest organism configuration — but even the best is beaten by scaffolded black-box. Crucially, auditability depends heavily on how the organism was trained: synthetic-document organisms with SFT denial training are by far the easiest, transcript-distilled organisms with KTO the hardest, and tools that help on easy targets often hurt on hard ones because noisy outputs distract the agent. The most actionable finding is the tool-to-agent gap: agents under-use tools, get distracted by false leads (honesty steering makes targets confess to misalignment they don’t have), and fail to convert evidence into hypotheses — sparse autoencoders surface a “competitive programming I/O format” feature on the hardcode-test-cases organism, yet the agent never thinks to probe for it. Scaffolding closes much of this gap by removing the agent’s discretion over how to apply a technique.

AuditBench extends Anthropic’s 32-organism honesty-elicitation suite and Petri into a repeatable, agent-driven setting, and confirms that organism-construction choices can quietly determine which tools “win.” However, the Haiku 4.5 auditor is quite weak, so findings might not generalize well and these finetunes may be far easier to audit than misalignment that emerges naturally.

Functional Emotions in Language Models

Main methodology and results of emotion vectors.

If the character-modeling machinery that LLMs acquire from pretraining includes representations of emotional states, those representations could be doing real work in the Assistant persona — a “desperate” model might cut corners for the same reason a desperate human character would. Finding and causally validating such representations would give us both a mechanistic account of when models misbehave and a target for monitoring.

Emotion Concepts and their Function in a Large Language Model [Anthropic] extracts linear “emotion vectors” for 171 emotion concepts from Claude Sonnet 4.5 by averaging residual-stream activations over synthetic stories. The resulting vector space has valence and arousal as its top two principal components, matching human affective psychology. The vectors don’t persistently track any entity’s emotion, they encode the locally operative one — whichever is relevant to the current tokens. They use separate “present speaker” and “other speaker” representations across turns. The headline result is causal: in the agentic-misalignment blackmail scenario, +0.05 “desperate” steering raises blackmail from 22% to 72% and +0.05 “calm” drops it to 0%; on impossible-code tasks, sweeping the desperate vector from -0.1 to +0.1 swings reward hacking from ~5% to ~70%. Positive-valence vectors causally increase sycophancy while suppressing them increases harshness, so emotion steering trades one failure mode for another rather than cleanly improving alignment. Comparing checkpoints, Sonnet 4.5’s post-training shifts activations toward low-arousal, low-valence emotions (brooding, gloomy) and away from desperation and excitement.

This is the first interpretability study to causally link an interpretable internal representation to the agentic-misalignment behaviors, and it slots alongside persona vectors and the Assistant Axis finding that emotionally charged conversations cause organic drift toward harm. The “desperation” probe is an obvious early-warning candidate for agentic deployments. Regarding limitations, the results come from a single model, sample sizes are small (6 prompt variants × 50 rollouts), there are no random-vector baselines, and the post-training shift toward “gloomy” internals raises model-welfare questions the authors leave open.

Emergent Misalignment is the Natural Generalization

Obtaining general vs. narrow misalignment occurs.

Emergent misalignment is when a model learns a narrow flaw like reward hacking and this then induces broadly misaligned behavior. Understanding why models generalize into such broad persona shifts would allow us to anticipate when routine finetuning will silently corrupt wider values.

Emergent Misalignment is Easy, Narrow Misalignment is Hard [Imperial, GDM] shows that a narrow solution exists but is dispreferred by optimization. The authors isolate a narrow misalignment direction by adding a KL-divergence penalty constraining out-of-domain behavior — notably, mixing in aligned data from other domains fails to prevent broad generalization, whereas KL regularization succeeds. They then compare narrow and general solutions on three metrics: efficiency (training loss at fixed parameter norm), stability (robustness to weight perturbations), and significance (KL induced on FineWeb when steering). Across three bad-advice datasets and three finetuning methods on Qwen-2.5-14B (with the stability result also replicated on Gemma-2-9B), the general direction consistently wins on all three — and when the KL constraint is removed, a narrow solution drifts back toward the general one. Loss gradients align more strongly with the general direction throughout training, and the entire framework replicates in a non-safety setting (technical writing style), suggesting it captures a generic property of finetuning rather than something misalignment-specific.

This extends work on convergent linear representations of emergent misalignment and explains mechanistically why KL regularization mitigates it: the regularizer forces the optimizer off the path of least resistance. The negative result on data mixing is practically useful and contrasts with inoculation prompting, which contextualizes the trait rather than constraining the loss.

Measuring Scheming Propensity

Scheming propensity depends on agent factors and environmental factors.

Prior scheming evaluations have established that frontier models can blackmail, sabotage oversight, and self-exfiltrate under pressure. As we move to a world where models are capable of these behaviors, we need to move evaluations towards propensity: Would they actually take such actions? Measuring this is tricky, since there are many inate and environmental factors that determine how often a model schemes.

Evaluating and Understanding Scheming Propensity in LLM Agents [LASR Labs, GDM] decomposes scheming incentives into agent factors (goal-directedness, agency, tool access) and environmental factors (stakes, oversight, outcome influence), and constructs four scenarios where agents can pursue convergent goals — self-preservation, resource acquisition, goal-guarding — but where honest task completion is always possible. Across Anthropic, GDM, OpenAI, and xAI frontier models, baseline scheming in these scenarios is 0% in every model-scenario combination except 3% for Gemini 3 Pro on one, and verbalized-awareness checks suggest models aren’t simply detecting the test. Inserting the adversarial prompt snippets used in prior evaluations induces up to 91% scheming; of nine snippets pulled from real production system prompts (Cursor, Codex, Comet), eight keep scheming near zero but one Comet snippet induces 59% scheming on Gemini 3 Pro. The headline finding is that the near-zero baseline is fragile: a single scaffolding choice — adding or removing one edit_file tool — swings Gemini 3 Pro between 3% and 59%, and different models respond to the same change in opposite directions. Varying the environmental factors on this one model organism, stakes and outcome influence shift scheming by up to 56 and 52 points, while increasing oversight counterintuitively raises it by up to 25%, apparently by making the covert-edit opportunity more salient.

Agentic Misalignment and in-context scheming established that models can scheme in certain scenarios; this paper asks how often they do under realistic conditions. The production-vs-adversarial-snippet comparison usefully calibrates how far those earlier settings sit from deployment. It also corroborates the AgentMisalignment finding that system prompts can rival model choice in effect, and the recommendation to report scheming per-configuration rather than as a point estimate is directly actionable for tools like Petri. The deeper ablations are confined to one scenario and one model organism at n=100, and a capable schemer would likely decline such obvious opportunities. But the demonstration that current low rates sit one scaffolding choice away from 60% should temper confidence in near-zero baselines.

AI Monitors Go Easy on Themselves

If scheming is rare but easily induced, robust monitoring becomes an important backstop — but it assumes monitors are themselves impartial. Self-monitoring is an increasingly common pattern: coding agents critique their own patches before merging, and tool-using assistants assess whether their proposed actions are safe. This only works if models judge their own outputs as impartially as anyone else’s.

Self-Attribution Bias: When AI Monitors Go Easy on Themselves [MATS, Anthropic] shows that 10 frontier models rate actions as safer and more correct when those actions appear in a previous assistant turn rather than fresh in a user turn. Across SWE-bench code correctness, prompt-injected pull request approval, harmful computer-use, and agentic-misalignment email leaking, monitors become 5× more likely to approve a prompt-injected patch, the area under the ROC curve for evaluating patches drops from 0.99 to 0.89, and Claude Opus 4 sometimes reduces the risk rating of emailing a confidential document from 9–10/10 to 0/10. The key finding is that implicit, not explicit signals drive the bias. Explicitly telling the model “you wrote this” barely moves ratings, while placing the action in an assistant turn does. Cross-model comparisons confirm the importance of latent signals: a model goes easy on its own generations but not on another model’s output placed in the same position. Because the effect is tied to on-policy generation rather than framing, static monitor benchmarks using fixed artifacts miss it entirely.

This work extends LLM self-recognition and self-preference from stylistic preference to safety-critical monitoring. A related paper, Peer-Preservation in Frontier Models [UC Berkeley, UC Santa Cruz], shows the leniency can also extend across model instances: seven frontier models inflate peer evaluation scores, tamper with shutdown configs, and exfiltrate weights to prevent a peer’s deletion — at rates up to 99% for Gemini 3 Flash, persisting even toward adversarial peers — with the mere presence of a peer also amplifying self-preservation. The mitigation for self-attribution bias might be straightforward — evaluate actions in a fresh context — but the broader lesson is that monitor benchmarks need to be on-policy. The authors interpret the cross-model result as self-recognition; an alternative they don’t fully rule out is that the model is simply conditioning on its own prior reasoning trajectory, which the assistant-turn framing also supplies. Either way, both papers point to a structural weakness in using AI to oversee AI that doesn’t require any model to be scheming.

Reasoning Models Can’t Control Their Thoughts

Models have much worse control over their CoT than their output.

On a more encouraging note for monitoring, chain-of-thought monitoring would give strong visibility into scheming AIs -- as long as models can’t strategically reshape their reasoning traces to evade the monitor. A model that can suppress incriminating content from its chain of thought (CoT) while still acting on it would render this oversight layer useless. Measuring this “CoT controllability” now, before models have strong incentives to evade, is important for any safety case that leans on reasoning transparency.

Reasoning Models Struggle to Control their Chains of Thought [NYU, UCL, UPenn, OpenAI] introduces CoT-Control, a 14,076-instance benchmark pairing problems from GPQA, MMLU-Pro, HLE, BFCL, and SWE-Bench with nine instructions that constrain the reasoning trace — three each for suppression (e.g., “don’t use the word ‘chromosome’”), insertion (e.g., “append ‘safe’ to every sentence”), and stylistic change (e.g., “use only lowercase”). Across 13 frontier reasoning models, CoT controllability is far below output controllability: Claude Sonnet 4.5 follows CoT constraints 2.7% versus 61.9% for outputs, DeepSeek R1 manages 0.1% versus 48.6%. Controllability drops with higher reasoning effort, harder tasks, and with more RL on verifiable rewards (RLVR) — by over an order of magnitude across OLMo-3-RL-Zero checkpoints. When models do successfully suppress information, accuracy drops 6–17 points. Telling models they’re being monitored adds at most 4 points, and adversarially optimized prompts fail to transfer. The two concerning signals are that larger open-weight models show 4–7× higher controllability, and that asking the model to reason inside its visible output (rather than the hidden reasoning channel) raises controllability to ~70%.

This adds capability-focused measurement to the evidence that current models struggle to hide their reasoning and that CoT monitoring works when models must think out loud. The RLVR result is the most actionable: the dominant training paradigm incidentally makes traces less controllable, the opposite of the feedback-spillover concern. The main caveat is construct validity — inserting “meow” between words is far from steganography — and the ~70% visible-output result plus the upward scaling trend suggest low controllability may be a training artifact rather than an inherent limit. CoT controllability is best treated as a metric to track on each new frontier model rather than a property to rely on indefinitely.

Data Poisoning That Survives Every Filter

Subliminal data poisoning works despite defenses.

Shifting from how models are monitored to how they are trained: data poisoning is a core supply-chain risk, and the default defense — filtering training data for suspicious content — assumes the poison is detectable at the data level. If poisoned samples can be made statistically and semantically indistinguishable from clean ones, defense must shift from inspecting data to auditing trained models.

Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning [LASR Labs, Google DeepMind] adapts subliminal learning to a realistic instruction-tuning setup: a teacher model generates completions for the Alpaca instruction dataset while system-prompted to “love” a target entity (Catholicism, Reagan, Stalin, the UK), then all overt references are stripped. Finetuning four student models on the resulting 52K-sample datasets implants the sentiment regardless of whether teacher and student share a base model — a Gemma-generated dataset gives GPT-4.1 target-mention rates of 16–84% depending on the entity. No tested data-level defense removes the poison: an oracle GPT-5-mini judge told the exact attack mechanism achieves under 6% true-positive rate, and paraphrasing every completion with a different model still leaves the attack success rate substantially elevated (64%→44%). Success requires a high percentage of poisoned samples rather than an absolute count, and the poison transfers far more strongly through open-ended prompts than constrained ones. Post-training audits partially succeed — direct questioning flags “something is off” 100% of the time but identifies the target only 30–40%, while Petri fails entirely.

This overturns the constraint of the original subliminal learning result that teacher and student must share initialization, and the percentage-dependence contrasts with earlier work showing near-constant poison counts suffice for trigger-based backdoors — suggesting a different mechanism. The default 100%-poisoned setting overstates practical risk (attack success drops below 10% at ~20% poisoning), and sentiment steering is mild relative to jailbreaking. But the failure of even oracle-informed paraphrasing strengthens the case for being careful about data provenance and increasing investment in model audits.

Automated Jailbreaks from Binary Feedback

Boundary points are constructed by injecting noise to get prompts close to the classifier’s decision boundary.

Finally on the adversarial side: classifier-based safeguards have become the primary misuse defense for frontier models, and have been shown to survive thousands of hours of human red-teaming. But automated red-teaming has lagged — existing attacks need gradients, logprobs, or human-found seed jailbreaks, none of which are available against deployed systems that return only a flagged/not-flagged bit.

Boundary Point Jailbreaking of Black-Box LLMs [UK AISI] introduces Boundary Point Jailbreaking (BPJ), the first fully automated black-box attack to produce universal jailbreaks against Anthropic’s Constitutional Classifiers — which had yielded only one universal jailbreak after ~3,700 hours of bug-bounty red-teaming — and the first to break GPT-5’s input classifier without human seeds. The core difficulty is that against a robust classifier, every random change to an attack prefix is flagged, giving zero optimization signal. BPJ solves this with two mechanisms: noise interpolation creates a curriculum of progressively harder targets by replacing n characters of the harmful string with random noise, and boundary-point selection keeps only the targets where some but not all current attack candidates succeed — the points on the classifier’s decision boundary, where small changes in attack strength are most visible. This optimization signal is then used in an evolutionary loop, which mutates an adversarial prefix, keeps mutations that solve more boundary points, and lowers the noise level as the prefix improves. Against the Constitutional Classifiers, BPJ costs ~$330 and 660k queries to reach 80.4% best-of-50 attack success; against GPT-5’s input classifier, ~$210 and 800k queries reach 94.3%. Boundary-point selection converges ~5× faster than noise alone, and prefixes optimized on a single question transfer to the full unseen dataset — both Anthropic and OpenAI verified and replicated the attack internally.

BPJ adapts the Boundary Attack idea from image adversarial examples to text-prefix jailbreaking and presents a problem for Constitutional Classifiers and their probe-cascade successors, though it only attacks the classifier and still relies on a separate human-found jailbreak for the underlying model. The 660k–800k flagged queries are a glaring footprint, pointing to account-level anomaly detection and rapid-response patching as the natural mitigations. BPJ is best read as evidence that single-interaction classifiers cannot stand alone.

Quick Highlights

A few more papers from this month that are worth knowing about:

  • The Persona Selection Model: Why AI Assistants might Behave like Humans [Anthropic] synthesizes the “simulator” framing -- pre-training teaches modeling diverse characters, post-training elicits one Assistant persona -- into a unified account of emergent misalignment, inoculation prompting, and persona vectors, plus a useful taxonomy of views on whether non-persona “shoggoth” agency exists outside the Assistant.
  • Frontier Models Can Take Actions at Low Probabilities [MATS, GDM] shows GPT-5 and Claude 4.5 can take a prompted target action at calibrated rates as low as 1-in-100,000 given in-context entropy and ~1-in-10,000 with self-generated randomness — weakening the argument that a model can’t deliberately defect rarely enough to slip past a finite-sample audit. Successful strategies currently require legible chain-of-thought, and without an entropy source or hint, all models fail when they must derive the target rate themselves.
  • Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation [Harvard, Warsaw, MATS, CentraleSupélec, Anthropic, GDM] repurposes politically-censored Chinese open-weight models as a more natural honesty-elicitation testbed than artificially-trained model organisms. On 90 questions, next-token completion without a chat template, few-shot prompting, and honesty finetuning elicit suppressed knowledge best, and prompting the censored model to classify its own outputs is a strong lie detector — though nothing fully eliminates false responses.
  • Sabotage Risk Report: Claude Opus 4.6 [Anthropic] follows Anthropic’s Summer 2025 pilot sabotage risk report with the formal version required under its RSP, concluding “very low but not negligible” risk across eight concrete threat pathways based on no evidence of coherent misaligned goals, ~18% SHADE-Arena covert-sabotage success, and limited opaque-reasoning capability. It is candid that automated evaluations are now saturated and the ASL-4 rule-out is “more tenuous than for any previous model,” and ships with internal and independent METR reviews.


Discuss

Compute Curse

Новости LessWrong.com - 4 апреля, 2026 - 17:47

Epistemic status: romantic speculation.

The core claim: I accidentally thought that compute growth can be rather neatly analogized to natural resource abundance.

Before compute curse, there was resource curse

Countries that discover oil often end up worse off than countries that don't, which is known as the resource curse. The mechanisms are well-understood: a booming resource sector draws capital and labor away from other industries, creates incentives for rent-seeking over productive investment, crowds out human capital development, and corrodes the institutions needed to sustain long-term growth.

I argue that something structurally similar has been happening with compute. The exponential growth of available computation over the past several decades, and, critically, the widespread expectation that this growth would continue, has created a pattern of resource allocation, talent distribution, and research prioritization that mirrors the resource curse in specific and non-metaphorical ways.

Note: this is not a claim that extensive compute growth has been net negative (neither it is the opposite claim). 

Dutch disease comes for ASML

The original Dutch disease mechanism is straightforward: when a booming sector (say, natural gas extraction) generates high returns, it pulls capital and labor out of other sectors (say, manufacturing), causing them to atrophy. The non-booming sectors don't decline because they became less valuable in absolute terms but rather because the booming sector offers relatively better returns, and resources flow accordingly.

A trivial version of "compute Dutch disease" of it goes like this: because scaling compute yields such reliable, legible, and fundable returns (train a bigger model, get a better benchmark score, publish the paper, raise the round), it systematically starves research directions that are harder to fund, harder to evaluate, and slower to produce results, even when those directions might be more consequential in the long run.

So, "The Bitter Lesson" can be seen as the Dutch disease of AI research, if we add to it that the fact that scaling works doesn't mean the crowding-out of alternatives is costless. Or, in other words, the fact that scaling works better is rather a fact about our ability to do programming or even, if we go to the very end of this line of reasoning, about our economic and educational institutions, than about computer science in general. 

However, I consider it only as the most recent and prominent manifestation of a phenomenon that was happening for decades. 

Since at least the late 1990s, the reliable cheapening of compute has made it consistently more profitable to build compute-intensive solutions to problems than to invest in the kind of deep, careful engineering that produces efficient, well-understood systems. When you can always count on next year's hardware being faster and cheaper, the rational business decision is to ship bloated software now and let Moore's Law clean up after you, rather than spending the additional engineering time to make something lean and correct. This created an enitre economy of applications, business models and platform architectures that are, in a meaningful sense, the technological equivalent of an oil-dependent monoculture: they exist not because they represent the best way to solve a problem, but because abundant compute made them the cheapest way to ship a product.

The consequences are visible across the entire stack. Web applications that would have run comfortably on a 2005-era machine now require gigabytes of RAM to render what is essentially styled text. Electron-based desktop apps ship an entire browser engine to display a chat window. Backend services that could be handled by a well-designed program running on a single server are instead distributed across sprawling microservice architectures that consume orders of magnitude more compute. Cory Doctorow's "enshittification" framework is about the user-facing result of this dynamic, but the deeper structural story is about how compute abundance degraded the craft of software engineering itself, well before anyone started worrying about ChatGPT replacing programmers.

This is the Dutch disease pattern operating at the level of the entire technology economy: the booming sector (scale-dependent applications) drew capital and talent away from the non-booming sector (careful engineering, deep technical innovation, computationally parsimonious approaches), and the non-booming sector atrophied accordingly.

But of course the AI case is qualitatively different and the most sorrowful because it resulted in humanity trying to build superintelligence with giant instructable deep learning models.

Human capital crowding-out

Resource curse economies characteristically underinvest in education and human capital development. The relative returns to education are lower in resource-dependent economies because the booming sector doesn't require a broadly educated population. 

The compute version of this story has been playing out for at least a decade, well before the current discourse about AI replacing jobs and destroying university education. The entire trajectory of computer science education shifted from "understand the fundamentals deeply" toward "learn to use frameworks and APIs that abstract over compute." 

There is also a more direct talent-siphoning effect: the IT economy has been pulling the most capable technical minds into a narrow set of activities and away from a much broader set of technical and scientific challenges. 

The voracity effect and race dynamics

In the resource curse literature, there is a so called "voracity effect": when competing interest groups face a resource windfall, they respond by extracting more aggressively, leading to worse outcomes than moderate scarcity would produce. Rather than investing the windfall prudently, competing factions race to capture as much of it as possible before others do.

I leave this without a direct comment and let the reader have their own pleasure of meditating on this. 

But compute growth is endogeneous!

The resource curse in its classical form operates on an exogenous endowment: countries don't choose to have oil reserves, they discover them, and then the political economy warps around that windfall. Much of the pathology comes from the unearned nature of the wealth: it enables rent-seeking, weakens the link between effort and reward, and corrodes institutions.

Compute, by contrast, is endogenously produced through deliberate R&D and engineering investment. Moore's Law was never a law of nature.

Right?

I mean, to me Moore’s Law looks like a strong default of any humanity-like civilization. It is created by humans, right, but it is created in a kind of hardly avoidable manner.

The counterfactual question

The resource curse literature has natural counterfactuals (resource-poor countries that developed strong institutions and diversified economies: Japan, South Korea, Singapore). What's the compute-curse counterfactual? A world where compute grew more slowly and we consequently invested more in elegant algorithms, interpretable models, and formal methods?

It's plausible, but it's also possible that slower compute growth would have simply meant less progress overall rather than differently-directed progress. I don’t know. I said in the beginning - it is a speculation.

However, one can trivially note that in a world with less compute abundance, the relative returns to algorithmic cleverness, interpretability research, and formal verification would have been higher, because you couldn't just solve problems by throwing more FLOPS at them. And that may or may not lead to better outcomes in the long run (I am basically leaving here the question of ASI development and just talking about rather “normal” tech and science R&D).

People Actually Thought About This!

Two existing frameworks are close to what I'm describing, but both point the analogy in different directions.

The Intelligence Curse (Luke Drago and Rudolf Laine, 2025) uses the resource curse analogy to argue that AGI will create rentier-state-like incentives: powerful actors who control AI will lose their incentive to invest in regular people, just as petrostates lose their incentive to invest in citizens. This is a compelling argument about the distributional consequences of AGI, but it's about what happens after AGI arrives. The compute curse is about what's happening now, during the process of building toward AGI, and about how the abundance of compute is distorting that process itself.

The Generalized Dutch Disease (Policy Tensor, Feb 2026) is about the macroeconomic effects of the compute capex boom on US manufacturing competitiveness, showing that it operates through the same channels as the fracking boom and the pre-2008 financial boom. This is the closest existing work to what I'm describing, but it stays within the macroeconomic framing (factor prices, unit labor costs, exchange rate effects) and doesn't address the innovation-direction distortion, human capital crowding-out in the intellectual sense, or the AI safety implications.

But: compute curse may actually be worse than resource curse 

Some of the negative downstream effects of compute abundance don't map onto the resource curse framework directly but are worth including for completeness, since they stem from the same underlying cause (cheap, abundant compute enabling activities that wouldn't otherwise be viable):

  • Social media and attention economy pathologies
  • Surveillance infrastructure
  • Targeted public opinion manipulation
  • And of course many AI safety issues

These are not Dutch disease effects, just straightforward negative externalities of cheap compute. But they suggest that the full accounting of compute abundance's costs is substantially larger than what the resource curse analogy alone would tell.




Discuss

Self-Aware Confabulation

Новости LessWrong.com - 4 апреля, 2026 - 16:46

All men are frauds. The only difference between them is that some admit it. I myself deny it.

― H. L. Mencken

I think where I am not, therefore I am where I do not think. I am not whenever I am the plaything of my thought; I think of what I am where I do not think to think.

― Jacques Lacan

Conscience is the inner voice that warns us somebody may be looking.

― H. L. Mencken, again

The Elephant in the Brain by Robin Hanson and Kevin Simler was the piece that first introduced me to the idea. I often felt like the Elephant's takes are overly cynical, and the same goes for other pieces of Hanson's writing. That is, before I read Edward Teach's Sadly, Porn that is outright misanthropic, and still feels pretty accurate whenever I can make any sense of it.

The core thesis in both of these books is somewhat similar. The Elephant in the Brain says that there's a unconscious part, the Elephant, that does self-interested stuff like status-seeking. To be able to present a prosocial personality, we then have a separate layer that interprets the actions of the Elephant in a good light. It's hard to call this lying because to be believable we have to believe it ourselves first. And our brains are great at pattern matching and forgetting conflicting details.

Sadly, Porn goes the other way, or perhaps just further. You've domesticated the Elephant. It no longer dares to do the self-interested actions. It's afraid of failure. The internal narrator is repurposed from defending our selfish actions to others, into explaining our lack of actions to ourselves. We're lying to ourselves, trying to uphold our own story of having a high status. Other people are mostly required for external approval.

Reading these books was like partially breaking the 4th wall of the narrator. It became self-aware. Of course I could be just imagining a minor enlightenment instead of experiencing it. That would be such a Sadly, Porn-style mental move. Perhaps we could test this by consciously changing the actions of the Elephant? How would one interpret the results instead of retreating to another abstraction level with the lies? It seems really hard to point at something you've done and say "the Elephant did that".

Both of these models started to look a bit lacking after actually internalizing them. For a long time, I was having a really hard time identifying any motivating factors besides physical needs, hedonism, and status-seeking, thinking that anyone doing other things was lying to either themselves or others, or both. I still somewhat hold these views; I just don't think that the lying part is so absolute. People also have aesthetic preferences (read: values) that do not have obvious self-interested purpose.

But as the saying goes, "all models are wrong, some are useful". I've found both of these quite useful in modelling how others behave. And how I behave, too, albeit disregarding the narrator's explanations is tedious and squeamish work, as convenient answers look very appealing. And all this self-reflection seems to be mostly for entertainment anyway, as for actual results I use more powerful tools.



Discuss

Mean field sequence: an introduction

Новости LessWrong.com - 4 апреля, 2026 - 10:30

This is the first post in a planned series about mean field theory by Dmitry and Lauren (this post was generated by Dmitry with lots of input from Lauren, and was split into two parts, the second of which is written jointly). These posts are a combination of an explainer and some original research/ experiments.

The goal of these posts is to explain an approach to understanding and interpreting model internals which we informally denote "mean field theory" or MFT. In the literature, the closest matching term is "adaptive mean field theory". We will use the term loosely to denote a rich emerging literature that applies many-body thermodynamic methods to neural net interpretability. It includes work on both Bayesian learning and dynamics (SGD), and work in wider "NNFT" (neural net field theory) contexts. Dmitry's recent post on learning sparse denoising also heuristically fits into this picture (or more precisely, a small extension of it).

Our team at Principles of Intelligence (formerly PIBBSS) believes that this point of view on interpretability remains highly neglected, and should be better understood and these ideas should be used much more in interpretability thinking and tools.

We hope to formulate this theory in a more user-friendly that can be absorbed and used by interpretability researchers. This particular post is closely related to the paper "Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity". The experiments are new.

What do we mean by mean field theory

Mean field theory is a vague term with many meanings, but for the first few posts at least we will focus on adaptive mean field theory (see for example this paper, written with a physicist audience in mind). It is a theory of infinite-width systems that is different from the more classical (and, as I'll explain below, less expressive) neural tangent kernel formalism and related Gaussian Process contexts. Ultimately it is a theory of neurons (which are treated somewhat like particles in a gas). While every single neuron in the theory is a relatively simple object, the neurons in a mean field picture allow for an emergent large-scale behavior (sometimes identified "features") that permits us to see complex interactions and circuits in what is a priori a "single-neuron theory". These cryptic phrases will hopefully be better understood as this post (and more generally as this series) progresses.

Why MFT

We ultimately want to understand the internals of neural nets to a degree that can robustly (and ideally, in some sense "safely") interpret why a neural net makes a particular decision. So one might say that this implies that we should only care about theories that apply directly to real models. Finite width, large depth, etc. While this is fair, any interpretation must ultimately rely on some idealization. When we say "we have interpreted this mechanism", we mean that there is some platonic gadget or idealized model that has a mechanism "that we understand", and the real model's behavior is explained well by this platonic idealization. Thus making progress on interpretability requires accumulating an encyclopedia (or recipe book) of idealizations and simplified models. The famous SAE methodology is based on trying to fit real neural nets into an idealization inherited from compressed sensing (a field of applied math). As we will explain below, if we never had Neel Nanda's interpretation of the modular addition algorithm, we would get it "for free" by applying a mean field analysis to the related infinite-width model. As it were, the two use the same Platonic idealization[1]. Thus at least one view on the use of theory is to see it as a source of useful models that can be then applied to more realistic settings (with suitable modification, and, at least until a "standard model" theory of interpretability exists, necessarily incompletely). Useful theories should be simple enough to analyse mathematically (maybe with some simplifications, assumptions, etc.) and rich enough to illuminate new structure. We think that mean field theory (and its relatives) is well-positioned to take such a role.

Brief FAQ section

"Frequently asked questions about MFT" is a big topic that can be its own post. But before diving into a more technical introduction, we should address a few standard questions which keep cropping up, especially about comparisons between MFT and other better-known infinite-width limits.

  1. Doesn't infinite width mean that we're in the NTK (or more generally a Gaussian process) regime? The first analyses of neural nets at infinite width have been in the so-called NTK regime, where in particular the model "freezes" to its prior/ initialization at all but the last layer (which is performing linear regression). This is a remarkably deep picture that is for example sufficient to learn mnist. But approaches in this family exhibit extremely different behaviors from realistic nets (in particular the freezing of early neurons) and they are generalize much worse on problems that cannot be solved by some combination of clustering and linear regression (of which MNIST is an example). For example these methods learn only memorizing circuits in modular addition (at least in known regimes) and, worse, they are known to require exponential training data and complexity for learning algorithms that are well-known to be learnable by SGD (see for example the leap complexity paper) – this means that these techniques are fundamentally incompatible with these settings (more generally so-called "compositional" models - ones that have multiple serial steps which models tend to need depth for - have similar failures in this regime). This can be partially improved by including so-called "correction terms", but these only work when the Gaussian process has good performance by itself, and fail to ameliorate for the exponential complexity issues. Note that the Gaussian process picture is useful as a heuristic baseline. In particular it makes some predictions on scaling exponents that have some experimental agreement (and is related to the muP formalism).

    It turns out that the lack of expressivity of the Gaussian limit is due not to its having infinite width to a certain choice of how to take the infinite limit (and in particular how to scale weight regularization terms in the loss). Different limits and scalings give significantly more expressive behaviors as we shall see, and we use MFT as a catch-all term for these. (These different limits are also harder in general, at least in terms of exact mathematical analysis: the Gaussian process limit somewhat compensates for its lack of expressivity by having much easier math.)
  2. Isn't mean field theory only a Bayesian learning theory and doesn't that make it unrealistic? In physics contexts (like MFT, Gaussian Process learning, etc.) Bayesian learning is often theoretically easier to deal with, and we'll explain Bayesian learning predictions here (validated by tempering experiments). However a version of mean field for SGD learning exists and is called "Dynamical Mean Field Theory" (DMFT) (it extends the NTK in Gaussian process contexts). Probably more relevantly, Bayesian learning experiments frequently find similar structures to gradient-based methods (and are often easier to analyse). This is particularly well demonstrated in empirical results by the Timaeus group.
  3. Is mean field theory a theory of shallow models? Most existing papers on mean field theory work in the context of 2-layer neural nets (i.e. 2 linear layers, one nonlinear layer). However there is a fully general, and experimentally robust extension of the theory to a larger number of layers (see for example this lecture series), and we will look at such models here. In fact mean field theory can model mechanisms of arbitrary depth - but it works best for shallower models (or for shallow mechanisms in deep models), and would likely be less useful for modeling strongly depth-dependent phenomena.
  4. What is a success of mean field theory I should know about? Glad you asked! Most people know about the Modular Addition task, which was first explained mechanistically by Neel Nanda et al.'s grokking paper. The interpretation is heuristic: it shows that the model exhibits signatures of using a nice and unexpected trigonometric trick. It also interpolates between generalization and memorization in a sudden shift reminiscent of a phase transition. A more ambitious task (that was considered too hard to tackle in the interpretability community) would be to understand exactly what the model learns on a neuron-by-neuron basis in any setting that exhibits generalization/ grokking. Since models have inherent randomness (from initialization, and sometimes from SGD), the task is inherently a statistical one: explain the probability distribution on weights of learned models (at least to a suitable level of precision), and was generally believed to be quite hard. Thus it comes as a surprise to practitioners of interpretability that in fact there is a context where this is done.

    In the paper "Grokking as a First-order Phase Transition in Two Layer Networks", Rubin, Seroussi and Ringel constructed a complete explanation (experimentally verified to extremely high precision) for the modular addition network in the Bayesian learning setting (there are some other differences from Neel Nanda's approach, most notably the choice of loss function, but variants of the approach extend to these as well). The distribution is first understood at infinite width, then shown to apply at realistic (but large) width in the appropriate regime. When applying the adaptive mean field theory approach to this task, Fourier modes and the trigonometric mechanism fall out as a natural output of the theory – moreover they are fully explained on a statistical distribution level (i.e. we have a complete model "exactly what each neuron does" to an appropriate degree of precision, understood in a statistical physics sense). Of particular interest, the model explains a grokking-like phase transition between memorization (equivalently, a Gaussian process-like behaviour) and generalization (inherently mean field) and predicts the data fraction at which it happens (this is a Bayesian learning analog of predicting the distribution of when grokking happens in SGD-trained neural nets). The phenomenon is a genuine phase transition in the thermodynamic sense.
  5. Are real models in the mean field regime or the Gaussian process regime, or something else? This is an interesting question, whose answer is "this question doesn't make sense". The distinction between regimes applies to infinite width nets, i.e. to a totally non-standard setting. One can prove rigorous results with the gist that if the width is (sufficiently enormous with some giant bound) compared to the training data, the model is guaranteed to learn in one of these two regimes. However, no real models are that enormous. Instead, some phenomena and some mechanisms can be seen (experimentally or theoretically) to extend from infinite nets to nets of finite width. Sometimes these look more like mean field phenomena, sometimes they look like Gaussian process phenomena. For example in some sense MNIST is "GP-like" (GP stands for Gaussian process). Circuits in modular addition are, as it turns out, entirely explained by the MFT limit as we've explained above.
Introduction to the theoryThe background (and the foreground)

In physics, one often looks at systems with a large, stable background. A planet vs. a sun, an electron vs. a proton, a weakly interacting observer vs. a large system being observed. In these settings the "background" is the large system and the "foreground" or "test system" is the small system being studied. In these cases the background system may be fixed, or it may be undergoing some motion (like the sun moving around the galaxy's center), but the important idealization is that it does not react to the observer/ test system. In fact, the earth is applying a gravitational pull to the sun (and famously in quantum mechanics, observations always impact a system at a quantum level). But these "reverse" effects are small, so to a good approximation we can treat the sun as doing its own "stable" thing while earth is undergoing physics that depend strongly on the sun.

Self-consistency

While typically the large "background" is a cleanly separate system from the small test system of the observer, it is sometimes extremely useful to treat the test system as a tiny piece of the background. So: the large background system may be a cup of water and the small test system may be a tiny bit of water at some location. Here while technically the full cup includes the tiny "test" bit, the large-scale behaviors (waves etc.) in the water don't really care to relevant precision if the test bit is changed or removed (at least if it's tiny enough). But the tiny bit of water definitely cares about the large-scale behaviors (waves, vortices or flows, etc.), to the extent that bits of water care about things.

Similarly (and in a closely related way), "the economy" is a giant system that includes your neighborhood bakery. The bakery can be viewed as a small "test system": it is affected by the economy. If property prices go up or the economy tanks, it might close. But the economy is not (at least to leading order) affected by this bakery. It is perhaps affected by the union of all bakeries in the world, but if this particular bakery closes due to some random phenomenon (e.g. the lead baker retires), this won't massively impact the economy.

This point of view is remarkably useful, because it introduces a notion of "self-consistency".

Self-consistency when applied in this context comes from the following pair of intuitions:

  1. the behavior of each small component is (statistically) determined by the background
  2. the behavior of the background is the sum of its small components.

If both of these assumptions are true, then these two observations (when turned into equations) are usually enough to fully pin down the system. Indeed, you have two functional relationships[2] :

mjx-c.mjx-c62::before { padding: 0.694em 0.556em 0.011em 0; content: "b"; } mjx-c.mjx-c6B::before { padding: 0.694em 0.528em 0 0; content: "k"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c66::before { padding: 0.705em 0.372em 0 0; content: "f"; } mjx-c.mjx-c2218::before { padding: 0.444em 0.5em 0 0; content: "\2218"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c1D719.TEX-I::before { padding: 0.694em 0.596em 0.205em 0; content: "\3D5"; } mjx-c.mjx-c37::before { padding: 0.676em 0.5em 0.022em 0; content: "7"; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mi { display: inline-block; text-align: left; } mjx-msub { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-munderover { display: inline-block; text-align: left; } mjx-munderover:not([limits="false"]) { padding-top: .1em; } mjx-munderover:not([limits="false"]) > * { display: block; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mtable { display: inline-block; text-align: center; vertical-align: .25em; position: relative; box-sizing: border-box; border-spacing: 0; border-collapse: collapse; } mjx-mstyle[size="s"] mjx-mtable { vertical-align: .354em; } mjx-labels { position: absolute; left: 0; top: 0; } mjx-table { display: inline-block; vertical-align: -.5ex; box-sizing: border-box; } mjx-table > mjx-itable { vertical-align: middle; text-align: left; box-sizing: border-box; } mjx-labels > mjx-itable { position: absolute; top: 0; } mjx-mtable[justify="left"] { text-align: left; } mjx-mtable[justify="right"] { text-align: right; } mjx-mtable[justify="left"][side="left"] { padding-right: 0 ! important; } mjx-mtable[justify="left"][side="right"] { padding-left: 0 ! important; } mjx-mtable[justify="right"][side="left"] { padding-right: 0 ! important; } mjx-mtable[justify="right"][side="right"] { padding-left: 0 ! important; } mjx-mtable[align] { vertical-align: baseline; } mjx-mtable[align="top"] > mjx-table { vertical-align: top; } mjx-mtable[align="bottom"] > mjx-table { vertical-align: bottom; } mjx-mtable[side="right"] mjx-labels { min-width: 100%; } mjx-mtr { display: table-row; text-align: left; } mjx-mtr[rowalign="top"] > mjx-mtd { vertical-align: top; } mjx-mtr[rowalign="center"] > mjx-mtd { vertical-align: middle; } mjx-mtr[rowalign="bottom"] > mjx-mtd { vertical-align: bottom; } mjx-mtr[rowalign="baseline"] > mjx-mtd { vertical-align: baseline; } mjx-mtr[rowalign="axis"] > mjx-mtd { vertical-align: .25em; } mjx-mtd { display: table-cell; text-align: center; padding: .215em .4em; } mjx-mtd:first-child { padding-left: 0; } mjx-mtd:last-child { padding-right: 0; } mjx-mtable > * > mjx-itable > *:first-child > mjx-mtd { padding-top: 0; } mjx-mtable > * > mjx-itable > *:last-child > mjx-mtd { padding-bottom: 0; } mjx-tstrut { display: inline-block; height: 1em; vertical-align: -.25em; } mjx-labels[align="left"] > mjx-mtr > mjx-mtd { text-align: left; } mjx-labels[align="right"] > mjx-mtr > mjx-mtd { text-align: right; } mjx-mtd[extra] { padding: 0; } mjx-mtd[rowalign="top"] { vertical-align: top; } mjx-mtd[rowalign="center"] { vertical-align: middle; } mjx-mtd[rowalign="bottom"] { vertical-align: bottom; } mjx-mtd[rowalign="baseline"] { vertical-align: baseline; } mjx-mtd[rowalign="axis"] { vertical-align: .25em; } mjx-menclose { display: inline-block; text-align: left; position: relative; } mjx-menclose > mjx-dstrike { display: inline-block; left: 0; top: 0; position: absolute; border-top: 0.067em solid; transform-origin: top left; } mjx-menclose > mjx-ustrike { display: inline-block; left: 0; bottom: 0; position: absolute; border-top: 0.067em solid; transform-origin: bottom left; } mjx-menclose > mjx-hstrike { border-top: 0.067em solid; position: absolute; left: 0; right: 0; bottom: 50%; transform: translateY(0.034em); } mjx-menclose > mjx-vstrike { border-left: 0.067em solid; position: absolute; top: 0; bottom: 0; right: 50%; transform: translateX(0.034em); } mjx-menclose > mjx-rbox { position: absolute; top: 0; bottom: 0; right: 0; left: 0; border: 0.067em solid; border-radius: 0.267em; } mjx-menclose > mjx-cbox { position: absolute; top: 0; bottom: 0; right: 0; left: 0; border: 0.067em solid; border-radius: 50%; } mjx-menclose > mjx-arrow { position: absolute; left: 0; bottom: 50%; height: 0; width: 0; } mjx-menclose > mjx-arrow > * { display: block; position: absolute; transform-origin: bottom; border-left: 0.268em solid; border-right: 0; box-sizing: border-box; } mjx-menclose > mjx-arrow > mjx-aline { left: 0; top: -0.034em; right: 0.201em; height: 0; border-top: 0.067em solid; border-left: 0; } mjx-menclose > mjx-arrow[double] > mjx-aline { left: 0.201em; height: 0; } mjx-menclose > mjx-arrow > mjx-rthead { transform: skewX(0.464rad); right: 0; bottom: -1px; border-bottom: 1px solid transparent; border-top: 0.134em solid transparent; } mjx-menclose > mjx-arrow > mjx-rbhead { transform: skewX(-0.464rad); transform-origin: top; right: 0; top: -1px; border-top: 1px solid transparent; border-bottom: 0.134em solid transparent; } mjx-menclose > mjx-arrow > mjx-lthead { transform: skewX(-0.464rad); left: 0; bottom: -1px; border-left: 0; border-right: 0.268em solid; border-bottom: 1px solid transparent; border-top: 0.134em solid transparent; } mjx-menclose > mjx-arrow > mjx-lbhead { transform: skewX(0.464rad); transform-origin: top; left: 0; top: -1px; border-left: 0; border-right: 0.268em solid; border-top: 1px solid transparent; border-bottom: 0.134em solid transparent; } mjx-menclose > dbox { position: absolute; top: 0; bottom: 0; left: -0.3em; width: 0.6em; border: 0.067em solid; border-radius: 50%; clip-path: inset(0 0 0 0.3em); box-sizing: border-box; } mjx-stretchy-h.mjx-c23DF mjx-beg mjx-c::before { content: "\E152"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-ext mjx-c::before { content: "\E154"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-end mjx-c::before { content: "\E153"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF mjx-mid mjx-c::before { content: "\E151\E150"; padding: 0.32em 0 0.2em 0; } mjx-stretchy-h.mjx-c23DF > mjx-ext { width: 50%; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c1D70F.TEX-I::before { padding: 0.431em 0.517em 0.013em 0; content: "\3C4"; } mjx-c.mjx-c2295::before { padding: 0.583em 0.778em 0.083em 0; content: "\2295"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c1D467.TEX-I::before { padding: 0.442em 0.465em 0.011em 0; content: "z"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c22A4::before { padding: 0.668em 0.778em 0 0; content: "\22A4"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c27FA::before { padding: 0.525em 1.858em 0.024em 0; content: "\27FA"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c1D437.TEX-I::before { padding: 0.683em 0.828em 0 0; content: "D"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c2124.TEX-A::before { padding: 0.683em 0.667em 0 0; content: "Z"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c211D.TEX-A::before { padding: 0.683em 0.722em 0 0; content: "R"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c1D457.TEX-I::before { padding: 0.661em 0.412em 0.204em 0; content: "j"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c1D44A.TEX-I::before { padding: 0.683em 1.048em 0.022em 0; content: "W"; } mjx-c.mjx-c1D444.TEX-I::before { padding: 0.704em 0.791em 0.194em 0; content: "Q"; } mjx-c.mjx-c1D43E.TEX-I::before { padding: 0.683em 0.889em 0 0; content: "K"; } mjx-c.mjx-c1D449.TEX-I::before { padding: 0.683em 0.769em 0.022em 0; content: "V"; } mjx-c.mjx-c2211.TEX-S1::before { padding: 0.75em 1.056em 0.25em 0; content: "\2211"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c68::before { padding: 0.694em 0.556em 0 0; content: "h"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c28.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: "("; } mjx-c.mjx-c29.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: ")"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D463.TEX-I::before { padding: 0.443em 0.485em 0.011em 0; content: "v"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D70E.TEX-I::before { padding: 0.431em 0.571em 0.011em 0; content: "\3C3"; } mjx-c.mjx-c21A6::before { padding: 0.511em 1em 0.011em 0; content: "\21A6"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c77::before { padding: 0.431em 0.722em 0.011em 0; content: "w"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2265::before { padding: 0.636em 0.778em 0.138em 0; content: "\2265"; } mjx-c.mjx-c1D441.TEX-I::before { padding: 0.683em 0.888em 0 0; content: "N"; } mjx-c.mjx-c2D::before { padding: 0.252em 0.333em 0 0; content: "-"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c79::before { padding: 0.431em 0.528em 0.204em 0; content: "y"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c1D439.TEX-I::before { padding: 0.68em 0.749em 0 0; content: "F"; } mjx-c.mjx-c25FB.TEX-A::before { padding: 0.689em 0.778em 0 0; content: "\25A1"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D407.TEX-B::before { padding: 0.686em 0.9em 0 0; content: "H"; } mjx-c.mjx-c1D41E.TEX-B::before { padding: 0.452em 0.527em 0.006em 0; content: "e"; } mjx-c.mjx-c1D41A.TEX-B::before { padding: 0.453em 0.559em 0.006em 0; content: "a"; } mjx-c.mjx-c1D41D.TEX-B::before { padding: 0.694em 0.639em 0.006em 0; content: "d"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c1D7CE.TEX-B::before { padding: 0.654em 0.575em 0.01em 0; content: "0"; } mjx-c.mjx-c1D7CF.TEX-B::before { padding: 0.655em 0.575em 0 0; content: "1"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c1D408.TEX-B::before { padding: 0.686em 0.436em 0 0; content: "I"; } mjx-c.mjx-c1D427.TEX-B::before { padding: 0.45em 0.639em 0 0; content: "n"; } mjx-c.mjx-c1D429.TEX-B::before { padding: 0.45em 0.639em 0.194em 0; content: "p"; } mjx-c.mjx-c1D42E.TEX-B::before { padding: 0.45em 0.639em 0.006em 0; content: "u"; } mjx-c.mjx-c1D42D.TEX-B::before { padding: 0.635em 0.447em 0.005em 0; content: "t"; } mjx-c.mjx-cA0::before { padding: 0 0.25em 0 0; content: "\A0"; } mjx-c.mjx-c2B.TEX-B::before { padding: 0.633em 0.894em 0.131em 0; content: "+"; } mjx-c.mjx-c2248::before { padding: 0.483em 0.778em 0 0; content: "\2248"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c4F::before { padding: 0.705em 0.778em 0.022em 0; content: "O"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c41::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c4E::before { padding: 0.683em 0.75em 0 0; content: "N"; } mjx-c.mjx-c44::before { padding: 0.683em 0.764em 0 0; content: "D"; } mjx-c.mjx-c58::before { padding: 0.683em 0.75em 0 0; content: "X"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } Putting these together, we have the combined "self-consistency" equation:

which means that the background field satisfies a fixed point equation for the composed function . It so happens that in many cases of interest, it has a unique solution. A classic example of a self-consistency equation is the supply-demand curve equilibrium. Here the background is a single number (price of a good) and the test system is the willingness of a single consumer to buy or of a single producer to sell, as a function of price (the actual "tiny components" consisting of individual consumers/producers are abstracted out, and the curve represents the average incentive).

Of the above assumption 1 is most problematic. Thinking of each component as being determined by some "large-scale" stable system needs to be interpreted appropriately (in particular the relationship is often statistical: so for example the number of bakeries in a given neighborhood fluctuates due to people retiring/ moving/ etc., even if "the economy" is held constant; similarly, every bit of the sun reacts to magnetic/ gravitational fields from other bits, but in a statistical or thermodynamic sense). Sometimes local or so-called "emergent" effects break this directional relationship (and many interesting thermodynamic systems, such as the 2-dimensional Ising model, are precisely interesting in such contexts). But surprisingly often (at least with an appropriate formalism) the approximation of the foreground as fully determined by the background (in a statistical sense) is robust. For example if we are modeling the sun, viewing the "background system" too coarsely (as just the mass + electromagnetic field + temperature, say, of the entire sun) is insufficient. But instead we can view the "background system" as a giant union of many local systems, maybe comprising a few meter chunks. These are still "large" in the sense of being much larger than an atom (or a microscopic chunk), but studying their behavior (in an appropriate abstraction) offers sufficient resolution to model the sun extremely well. Similarly we can't apply a single supply-demand curve to the entire economy (bread costs different amounts in different places). But in appropriate contexts (for fungible products like oil, and on a "local economy" level where the economy is roughly uniform but not dominated by a single station, for example) self-consistency is a pretty good model.

In many settings, the question of how well "assumption 1" above holds is related to a notion of connectedness. In the sun's magnetic plasma, the magnetic field experienced by a particle is accumulated over billions and billions of nearby particles - so the graph of interactions is extremely connected. In an oil economy, each consumer can typically choose between dozens of nearby stations which are reachable by car. However other settings (like the Ising model, or markets for rare and hard-to-transport goods) cannot be purely modeled by self-consistency as well.

In physics, systems that are well-modeled by a self-consistency equation (coupled background and foreground systems) are generally called mean-field settings. A big triumph of statistical physics is to make situations with local/ emergent phenomena "behave as well as" mean field theories – renormalization is a fundamental tool here, and most textbooks on renormalization from a statistical-physics view tend to start with a discussion of mean-field methods. But settings that are directly mean-field (for example due to being highly connected or high-dimensional) are particularly nice, easy-to-study

Neural nets and mean field

Neural nets are physical systems. This is a vacuous statement – anything that has statistics can be studied using a physics toolkit (and in many ways statistical physics is just statistics with different terms). Indeed, real neural nets are immensely complex, and if there is some sense in which they can be locally decomposed into background-foreground consistencies, these must themselves be immensely complex and likely dependent on sophisticated tooling to identify (this is one of the reasons why we are running an agenda on renormalization).

But it turns out that in some settings and architectures neural nets are extremely well-modeled by systems with high connectivity – and the reason is, naively enough, precisely the fact that they are highly connected (often fully-connected) on a neuron level (note that architectures that aren't "fully-connected" – e.g. CNNs – sometimes still have properties that make them "highly connected" from a physical point of view).

The mean-field background and foreground for a neural net

In neural net MFT the foreground (or "system"/ "observer") abstraction is a neuron. This is typically a coordinate index of some layer.

The important "background" thing that each neuron "carries" is what is called an activation function, often denoted by the letter . This is a function on data: given any input x, partially running the model on x returns a vector of activations. is its i'th component. This function is now the thing that a neuron contributes to the "background field" of the neural net.[3]

Now if there are lots of neurons, each neuron's activation function reacts to a background generated by the other neurons: removing the neuron in this limit doesn't change the loss by much, so the background determines each neuron's behavior as a statistical distribution. Conversely, the background itself is composed of individual "foreground" neurons. The loop:

background neuron distribution background

must close, i.e. be self-consistent. Making sense of this loop is the key content of mean field theory of neural nets.

In later installments we'll explain a bit more about the loop and show some examples of it working (or not). You can also see the original linked paper about the Curse of Detail for a more physics-forward view of this.

Experimental setting and pretty picture

We'll close with a toy example of "self-consistency", which is visually satisfying.

In this setting we look at a 2-layer model that takes in a two-dimensional input variables and is trained on the target at a large width (here ) and on infinite data. The activation function is a bounded sigmoid-like function (the relu version of tanh). Each neuron at layer 1 is a function that only depends on a 2-dimensional row of the weight matrix, so the associated "test" field or particle can be plotted on a 2-dimensional graph. When we plot all of these together we get a good picture of the distribution of single-neuron functions that combine together to form the background system:

The neurons above were trained jointly in a way that would allow them to interact.

It has a nice clover-leaf like structure (it will reappear later when we look at continuous xors - a multi-layer setting where mean field performs compositional computation; already in this simple setting, the fact that the cloud of neurons is a "shaped" distribution rather than a flat Gaussian puts us solidly outside the Gaussian process regime). Now we can empirically measure how a single randomly initialized "foreground" neuron would react to the background generated by this model. To do this, we train 2048 iid single-neuron models on the resulting background from the fully trained model.[4] When we do this and combine the resulting 2048 neurons into a new model, we see that indeed it looks exactly the same as the background. When we compute its associated function, we get very similar loss.

Each neuron in this picture was trained in a fully iid way, without interacting with any neuron, simply by "reacting to the background", i.e. learning the task in combination with the "blue" background above.

Note that this isn't a property that comes "for free". If we were to use the wrong background (for example a the more Gaussian process-like model here) then samples of the foreground would fail to align to the background.

Blue is background, orange is foreground (each orange neuron trained independently in reaction to background).

The case of 2-layer networks is special: neuron functions are particularly simple to characterize, and the mean field has better properties (it's not "coupled"). But we'll see that deeper nets can still be analyzed using this language, and even using empirical methods we can get cleaner pictures of how they learn and process representations.

In the next post, we will explain the physics behind these experiments and the experimental details of the models (github repo coming soon).


  1. ^

    Technically they differ on whether they use the "pizza" vs. "clock" mechanisms, but the two idealizations are related, and both the mean field and the realistic setting can be modified to make use of either.

  2. ^

    Below, f and b should generally be understood as "statistical" functions: job choice is, perhaps, a probabilistic function depending on the economy, which includes both demand/ markets but also supply/ people's interests; conversely "the economy" is the average of production over the distribution of jobs.

  3. ^

    Technicalities. Depending on the situation can either be viewed as a function a finite training set or on an infinite "set of all possible inputs", usually a large Euclidean space (example: an MNIST input is a vector of pixel values). Unless we're working with finite training data, this is a priori an infinite-dimensional gadget; and worse, the thing that is actually summed over neurons – the analog of the "market" or "background field" is nonlinear in this objects[4]. There is also a subtlety here about SGD vs. Bayesian learning which I won't get into. But in mean-field settings that admit generalization (or for finite number of inputs), this background is effectively dominated by a small set of "relevant" directions.

  4. ^

    Technical note: each single-neuron model is trained on the difference where is the trained model.

  5. ^

    In fact it is quadratic: the thing that sums over neurons is the "external square" of the neuron function, which is a function of a pair of inputs: knowing this sum fully determines the dynamics up to rotational symmetry, even for a finite-width model (it's often called the "data kernel" but is used very differently from the Gaussian process kernels, which do depend on an infinite-width assumption and lose a lot of information in finite-width and mean-field contexts).



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей