Вы здесь

Сборщик RSS-лент

Sympathy for both sides of the egregious misalignment debate

Новости LessWrong.com - 12 июня, 2026 - 19:26

On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.

On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll probably say that it would come about through race dynamics, careless programmers, bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get egregious misalignment and scheming because nobody has the foggiest idea how to avoid that”.

Here’s my brief idiosyncratic take on this debate. I think BOTH of the following are true:

  • (1) If you really think carefully about the properties of ASI, you really do find good reasons to strongly expect it to be egregiously misaligned, scheming, and ruthless, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
  • (2) If you really think carefully about the properties of current LLMs, you really do find good reasons to think that existing technical alignment techniques are adequate now, and may well continue to be adequate in the future.

So then here are three (caricatured) positions:

My position:

(1) and (2) are both totally true. And we can reconcile them by saying that LLMs won’t scale to ASI.

Yudkowsky & Soares’s position [caricatured]:

(1) is totally true. We know this with great confidence, having spent decades thinking about it.

So it follows that (2) must be wrong or irrelevant.

Why is (2) wrong or irrelevant? Hard to say! There’s no ASI yet, and nobody knows in detail how it will appear. Sometimes it’s easier to predict what happens eventually than the detailed path. An ice cube in warm water will melt eventually, but don’t ask me to predict how many seconds it will take to melt, etc.

So anyway, one possibility is that (2) is wrong because LLMs will kinda ‘wake up’, or something, when the core pieces of true intelligence finally come together. And then their behavior would change drastically for the worse. And maybe we’re already starting to see glimmers of that in existing LLMs?

Or another possibility [cf. Eliezer tweet] is that LLMs will invent non-LLM ASI. And then (2) will be simply irrelevant!

…Or something else! Again, we don’t know! But we do know that (1) is definitely right.

LLM people’s position [caricatured]:

(2) is totally true. We know this with great confidence, because we are LLM experts and we have thought about these alignment plans in great detail, including matching our theories against real-world data.

So it follows that (1) must be incorrect.

Why is (1) incorrect? I don’t really know! Man, I read Yudkowsky and Soares, and it’s all these words, words, words, and I’m reading along and trying to match those words to my knowledge of LLMs and it just doesn’t make any damn sense. I can and will try to respond to their points in detail, but honestly the core issue is that they’re guilty of head-in-the-clouds armchair theorizing gone off the rails.

Conclusion

…So I think that both sides of the debate are basically coming from a reasonable and sympathetic place, with a big kernel of truth.

Bonus section: Further commentary

…That said, I can still complain at both sides!

My “true objection” to Yudkowsky & Soares:

For the record, my “true objection” to Yudkowsky & Soares is that if we’re talking about ASI, then LLMs are basically irrelevant and we shouldn’t even be talking about LLMs at all. And meanwhile, their plans are misguided because delaying ASI is possible on the margin but mostly hopeless, although I guess I’m happy that they’re trying anyway. Meanwhile, my hunch is that they’re overstating the intractability of finding that technical alignment breakthrough, although I haven’t found it yet, so I guess time will tell.

My within-frame complaint at Yudkowsky & Soares:

…But I’ll put that aside for the sake of argument, and bring up a narrower complaint within their frame:

I think their suggestions that LLMs may become completely egregiously misaligned in the future via … umm … the ‘true core of intelligence’ coming together, and ‘waking up’? Like Skynet or something?? That was mean, sorry, but in any case, I don’t think this idea hangs together either theoretically or empirically.

For the former (theory), see my discussion of the extreme weirdness of the LLM pretraining algorithm in Foom & Doom §2.3.2. I think Yudkowsky & Soares have not internalized how weird this type of learning algorithm is, and if they had, then Yudkowsky would not be occasionally suggesting that we should think of an LLM as an actress playing characters.

For the latter (empirical), I think the most fair assessment is that current LLMs are nice and obedient in some contexts, and LLMs are mean, defiant, and just plain weird in other contexts. You can straightforwardly go from that observation to “maybe there will be egregious misalignment and scheming in the future”, but not to “there will definitely be egregious misalignment and scheming in the future, absent new breakthrough technical alignment ideas”.

I think that if Yudkowsky & Soares stopped treating current LLMs as direct evidence for technical alignment being definitely completely unsolved, and instead treated it as either mixed evidence or entirely off-topic, then their public messaging would come across to policymakers and general audiences as somewhat more convoluted and confusing. But I think it would be more accurate. Oh well.

My “true objection” to LLM people:

For the record, my “true objection” to the LLM people is that I don’t really care about anything they say, because I’m working on the ASI alignment problem, and LLMs won’t scale to ASI.

(I’m overstating a bit. I’m generally happy for people to work on making LLM-world a place of wisdom and goodness, especially because LLM-world is the world in which ASI will someday be invented.)

My within-frame complaint at LLM people:

…But I’ll put that aside for the sake of argument, and bring up a narrower complaint within their frame:

I think the LLM people are not pricing in the predictable consequences of ever more RLVR and/or the predictable consequences of ever more “real” open-ended continual learning, should the latter ever be solved (which I don’t think it will be, but never mind that).

In other words, lots of LLM-focused people say “LLMs will eventually be able to do the things that human society did over the last 5000 years: open-endedly and autonomously build new knowledge and ideas on top of new knowledge and ideas, in an endless tower, with no need for human ground truth anywhere in that process. And how exactly will the future LLMs do that? Uhh, I don’t know, people are working on it, they’ll probably figure something out.”

…And bam, that blank spot in the map is where the pea gets hidden under the thimble.

Because if you want the LLMs to gain ever more knowledge, whether through a perpetual RLVR loop or some other yet-to-be-invented type of continual learning, there has to be some kind of ground truth, or else it will go off the rails into nonsense. And that ground truth, whatever it is, will basically amount to an objective function (a.k.a. cost function, reward function, whatever). And when the LLM updates enough on that ground truth, then whatever human-niceness that the LLM inherited from pretraining will get diluted away in favor of ruthless maximization of that objective function.

(See also: Why we should expect ruthless sociopath ASI.)

Thanks Zack M. Davis for a brief discussion that inspired this post.



Discuss

The Uncertainty That Matters Isn't Fundamental

Новости LessWrong.com - 12 июня, 2026 - 19:23

I'm on board with a lot of Fundamental Uncertainty. Even some of the stuff that initially feels like a disagreement turns out not to be so. For example, in chapter 8, Gordon writes:

Over the course of the previous chapters, I've made the case that truth is fundamentally uncertain. It's not, as many believe, something fixed and eternal, nor is it a matter of pure opinion. Instead, the relative truth we know is grounded, not by absolute truth alone, but by our need for accurate world models to achieve the goals we care about.

My first thoughts upon reading this kind of thing are along the lines of "Wtf does it mean for "truth" to be "eternal"? Could you taboo that and give me an example of what you'd use it for?" -- which is exactly turning Gordon's take on truth in on itself, so point taken. Truth is grounded in care, and fundamentally uncertain. I'm with ya here. True enough.

It is because of this agreement that I find chapter 8: "Why does fundamental uncertainty matter?" to be the chapter that matters. My answer, however, is quite different:

It does not, and it can not.

By virtue of talking about the uncertainty which can not be reduced even in principle, we're talking about the part we can't do anything about. The part that there's no reason to care about, because nothing we do can change it.

Ah, but it's important to know that we can't change it! Right? Chapter 8 isn't about resolving the unresolvable, it's about knowing what is unresolvable so we don't waste time trying.

Except this too, we cannot know.

The problem with the "problem of the criterion" is that should we get the criterion wrong, it infects everything. We cannot know that our residual uncertainty is fundamental, because fundamental uncertainty applies here as well. We can't even fall back to "well, sure enough", because to the extent that there's little left, there's little left. To the extent that we have a lot of fundamental uncertainty, we're fundamentally unsure how much and where it is biting us.

We can never know that the "unresolvable" residual is fundamental, and therefore can never justify relaxing into sufficient certainty that the uncertainty is fundamental. If we keep going, we never know what we might learn that we believed to be unlearnable.

Taken all the way, fundamental uncertainty undoes its own relevance: Fundamental Uncertainty stops one step short.

The examples given in chapter 8 are good examples which look fundamentally uncertain, but which actually resolve for those who care to look.

Gordon writes:

Yet there are times when fundamental uncertainty blocks us from finding truth. As we'll explore in the coming chapters, we get into debates about what words like "man" and "woman" really mean, fight over whether it's right or wrong to eat meat, and struggle to know what's best to do, not because we can't reason carefully about these topics, but because fundamental uncertainty limits how precisely we can reason about them. As we'll see, no matter how smart or wise we are, fundamental uncertainty ultimately stands in our way of knowing all that we wish to know.

I don't think anyone is up against any fundamental limits of precision. It is, I argue, insufficiently careful -- or in the case of culture war, carefully misleading -- reasoning all the way down.

Let's start where Gordon starts, with definitions:

But few people get into fights about whether road maps and driving are better than trail maps and hiking. That's because there's relatively little at stake in such a situation, and it's similar for the definitions of most words. For example, herbal tea isn't technically "tea" unless it contains leaves of Camellia sinensis: it's a tisane. Even that is debatable, though, because "tisane" comes from a Greek word for barley.

The stakes don't predict the conflict. My brother is celiac, for example, so there's considerable stake in whether he's drinking a "tea" or a "tisane". However, never has he yelled at anyone for getting terms wrong and calling a tisane a "tea". When he has any inkling a drink might contain gluten, he asks "Does this drink contain gluten?". The rationalist move of "tabooing" words isn't limited to us rationalist dorks. When gluten matters, people will ask about gluten, and not risk miscommunication. Both tabooing and code switching between specific and loose meanings of words like "tea" are easy moves, available to anyone.

Yet the culture war battle over definitions has not resolved with "just taboo your terms!" spreading like wildfire. Why not? Why has "Adult human males who identify as adult human females are humans that identify as female!" not caught on? Why hasn't "The union of one man and one woman, inherently oriented toward procreation and family formation, is between a man and a woman"? Is it the extra syllables?

I jest. No one disagrees. No one cares to push back -- at least, no one who pushes back against the side using those definitions pushes back once stated plainly like that. So why does this not resolve everything? If celiacs can allow you to call a tisane a "tea" and just not drink it, if we can use the same spelling and pronunciation to refer both to wooden whacking sticks and flying rats, why doesn't that happen here?

Gay men are already counted as "one of the girls" when it comes to girls night out, because in the ways that matter, often the fit is close enough. Why not say "Oh, it's women's night" with the implicit definition "fits in with these women in this way" when that's the appropriate definition, and use a different definition when it comes to sanctioning MMA fights? Why no satisfaction with the neologism "transwoman", which is sufficiently short and descriptive? While conservatives might quibble that the term for an adult human male identifying as a woman should be "transman" rather than "transwoman", that fight would be smaller, and everyone keeps reaching for the bigger fight. Who is going to fight with "Black Lives Matter Too", and why did neither side make that their slogan?

We could have clarity, at the cost of one more word. And yet, we spend many in ways that bury the shared factual agreement. We could use words in ways that account for context, as we normally do, yet here we do not. We act here as if words have One True Meaning, as if winning the fight over "What we say the definition is" imbues special significance to anything which meets the official definition.

If the problem were about fundamental uncertainty, why would we not use the solutions we use everywhere else when clarity is valued? Why would we do this, if there weren't already shared meaning in the word, that we're fighting over?

The explanation that predicts the observed behavior, is that culture war fights aren't about fundamental uncertainty at all. Culture war is about attempts to create artificial uncertainty, so that we can sneak in what we can't honestly argue for. The word means something, we all know what it is, and this is why there is perceived value in claiming the existing word instead of coining new terms.

Honest disagreement does not lead to fights. "I think the most important aspect of whether something is a 'tea' is just whether it's an infusion of plant stuff in water", "The presence of gluten is pretty important to me", "Oh, well in that context, yeah".

"Black lives matter", "Well yeah, all lives matter", "I'm glad you agree, but not everyone does so I feel it's important that we acknowledge that black lives matter too", "Oh. Fair. Yeah, black lives matter too".

How hard is this, really?

The cost of new words, or unclear words, is minimal when no one is trying to thumb the scale. We don't get pissy at each other when we recognize that we're all doing our honest best. If someone is getting upset, and the response isn't immediate de-escalation with new words in a visible "hands off the scale" move, that's a pretty strong indication that at least one side is trying to get away with not playing it straight.

Both sides know damn well they're fighting over connotations in a game with at least one dishonest player. Both would claim that the other side is the dishonest player. Fortunately, this is testable. If you want to know which side(s) are thumbing the scale, and how much, notice what happens when you suggest they clearly define the terms, and explain how their definition carves reality at the joints in useful ways.

Use the tools already argued for in Fundamental Uncertainty if you dare, and definitions will no longer be a sticking point for you -- neither in metaphysical questions or practical ones like "which bathroom to use?". The "uncertainty" here dissolves rather rapidly, in most cases.

After dissolving the camouflage that disguises the facts and values underneath "definitional" disputes, the next step is to decompose "values" into the factual predictions upon which they are built.

Is it right to take a finder's fee when one finds a wallet? How much fee and in which cases? I dunno. I am uncertain.

However, this is not fundamental uncertainty, unresolvable due to infinite regress of the criterion. This is something we could test. We could run the experiment. See which town ends up being the one you want to live in. That's the one with the right answer.

Try arguing that the moral thing to do is the thing that leads to no wallets getting returned, and everyone harmed more than helped. Try arguing that the place with good outcomes, where you want to live, isn't the one that is good. Not "argue that one could, theoretically, argue". Actually argue it. Can you generate an argument that you take seriously? It gets pretty hard to hold this position.

This is a move that's always available, even when non-trivial. Stealing heavily from the capstone post of my sequence on how to resolve disagreements that aren't obviously disagreements let alone resolvable:

Things don't bottom out at "Values difference!", because even if it's hard to see how "values" cash out in "truths", we still have to decide which values to prioritize, and there are better and worse ways to prioritize things.

If we run into a "Values difference!" of "Truth vs compassion" for example, then there's still a "way that things will play out if you prioritize truth" and a different "way that things will play out if you prioritize compassion". Sure, in the short term, watching someone's feelings get hurt by a tactless pushing of 'truth' won't change any minds, but that's just because the disagreement over implicit facts lies further in the future. If the community decays because of insufficient contact with truth, and people lose everyone they love, even the "compassion" favoring people will have something they can't ignore. If the community becomes nothing but fighting because shutting out truths relating to the value of compassion made things immediately fall apart, to never improve, then the "truth" favoring people won't be able to hold onto their perspective without shutting out truth themselves.

Whenever we feel like "we're both looking at the same reality, and disagreeing over values", what that shows us isn't that truth "doesn't exist", or "is relative" in an absolute sense, but that we aren't yet aware of what implicit differences we're in disagreement over. One side might see white lies as deeply wrong, and not know why. The other side might see using 'truth' as an excuse to be mean as obviously wrong, a priori, because compassion is just what matters. But we can always dig one step deeper, and ask: Why is compassion what matters? Why is it wrong to white lie? What's the harm, in either case?

And when we look, we can start to notice. The harm in white lies is that it breaks contact with reality, increases the risk of crashing into unseen rocks, and of not even noticing that you're sinking and that this could be avoided. The harm in pushing truth without compassion is that the subtle insecurities that lead you to do it without compassion also give hints that your implicit attitudes aren't knowably true either -- and those will sink you in very specific ways too. Careful allegiance to truth gets you compassion as well, when it's true that compassion is good. Like when someone is in such a rough place that they judge risking a car wreck as better than staying at a party, and the implicit worldviews motivating "That's dumb!" turn out to be false. Or when a girl at a music festival is being a turd because of whatever reason, but whatever reason that actually exists and is sufficient to make responding to her with anger or contempt wrong.

Heck, let's make this as hard as possible: abortion!

If anything is about something fundamentally unresolvable, it ought to be the thing revolving around "are fetuses people with moral worth or clumps of cells with none?". There's no fact of the matter that can resolve this, right? Just definitions based on values?

Maybe you're a pro-choicer, or could empathize with someone who is. What do you think you will feel, as a pro-choicer who has had an abortion or two because it was socially supported, should you end up struggling with fertility later in life once you finally decided you're "ready" for a child? What do you think regret lingering on does to the values that brought you there, if you don't make sure to look away? Should you end up watching your friends grow old and childless, longing for what they had thrown away, will you still value their right to make mistakes the same?

Maybe you're "pro-life", or could empathize with someone who is. God forbid you end up in an IVF clinic, hearing that the only way you will bring life into this world involves trashing a few fertilized eggs. As God is shoving in your face the fact that the only way you are bringing life into this world, is by accepting death as well, might you notice that death is always the price of life? That if God were anti-death, humans wouldn't age. Hyenas wouldn't rip baby wildebeest out of the womb in the wild. Once the stakes are real enough to sober you, do you think you might realize that you really are pro-life, and that your "never abortion" stance was in opposition to what you truly care about?

These are empirical questions. There are experiences, which if experienced, will change people's mind about which values are important, and in which cases. The value one places on eating yummy Chinese food changes drastically and automatically after food poisoning. The value one places on their marriage can change drastically upon learning of infidelity, even as people try to forgive. The value one places on the life of a fetus depends on whether the future anticipated is one of joy and life or suffering and life cut short.

The criterion is more shared than the ideology. When you want a baby, and you value the life you could potentially bring into this world, then looking will tell you whether "It's better to have kids when ready!" was real or cope. Whether "Abortions are all turning away from God!" was real or cope. Even if we are not yet aware of the anticipations we hold, even if we claim otherwise, our values speak to our factual predictions relative to the care that lies beneath our claims.

What experiences are implied by the values one holds? What experiences would change minds? What experiences are likely to happen?

These aren't always easy to find, and rarely fit our prior expectations or else it wouldn't seem unresolvable in the first place. The "committed asexual" friend I had didn't have different "terminal values" around sex, and if you're looking in the realm of argument then maybe the criterion he leaned on there would have kept the disagreement unresolvable even in theory. Yet it wasn't unresolvable. He met a woman. A woman who showed him through experience that his values weren't what he thought they were.

These "values" which we hold so dear, what is it that we're holding to them for? What outcomes are we anticipating, if we didn't? What experiences would we have to have, in order to feel our sense of value shifting underneath us? It's not comfortable, is it?

That abortion you're mulling over, is it going to result in more life, down the line? Will broader perspective bring regret for bringing this child into the world before you can care for him or her? Or will it bring regret for not bringing him or her into this world?

Values are about facts. There's nothing special that makes them immune to regular old evidence, should we dare to look.

We can make this even harder by noticing the cases in which care diverges. Is it right for the strongest to get the whole pie, or for the pie to be shared?

This might depend on if you're asking the strong.

Yet at the end of the day, "who gets the pie?" is a factual question. And I notice that appeals to the value of "fairness" will land with those otherwise getting no pie. Who, together, are strongest. I also notice that once the strong individual can see that he will get no pie if he tries to take the whole pie for himself, that this is the experience that will change his mind as well. For if he were built upon genes that did not rest upon this criterion, his ancestors would not have survived to build him. [1]

These are all purely factual questions, with ordinary uncertainty, which resolve disagreement in definitions and value, and coordinate our behavior. It is not always easy to see the full path to resolution at a glance, which is why I wrote my own book length sequence on the topic. Besides the ordinary problems of resolving uncertainty, the things that make these especially tricky, are our excuses to not look.

Excuses to not look are rarely chosen as such, for seeing a move as an attempt to excuse invalidates the excuse. Yet they are adopted, as apparent facts about reality which appear true. Or true enough, so long as the note of discord and importance thereof go unnoticed. Excuses disguise themselves as facts, as we fall unwittingly into the attractor.

"Fundamental Uncertainty" adds a meta layer to this mess.

A very sophisticated and general layer, produced through years of careful thought by someone clearly unable to stop looking without damn good justification. The concept of moral trades, motivated by recognition of moral uncertainty, is a good one. The concept of tabooing words, motivated by the recognition of definitions as useful conventions rather than Fundamental Truths, is also good.

At the same time, this delivery of useful truths comes with an uninvited stowaway. The rat that has snuck aboard this ship is the same kind of rat that sneaks aboard many: an excuse to stop looking.

If the culture war is "just about using words for different things", then we don't have to notice when people are advocating for definitions that are wrong, by the criteria a society can actually coordinate around, which we would all come to agree on if we were to stop pretending that disagreements are unresolvable.

Not noticing this gets us out of having to voice inconvenient truths and call out our ingroup, which is super convenient -- and progress halting.

If morality is fundamentally uncertain, then we no longer have obligation to figure out if we're wrong for slaughtering and eating all the animals -- which again, is super convenient, at risk of enabling atrocities.

Or on the other side, maybe our care for animals is misplaced, leading us to being bad to our fellow humans who we actually care about, in order to avoid looking "bad" to those whose judgements we care about.

It's certainly easier to have cease fires, and no doubt often practical as well. But it's not such a tempting option when we notice what we are giving up in order to avoid facing the uncertainty.

In order to justify these moves with "fundamental uncertainty", we need to know that we've reached the end of what can be known -- or at least, that we've made it close enough, and that the rest is at least probably irreducible.

Yet the more the possibility of a terminally misleading criterion limits our achievable certainty, the more it limits our certainty of anything. We can't even be "pretty sure" unless we're pretty sure there's not much left uncertain -- because if we think we're pretty sure that we're bad at nailing things down... how nailed down can that be?

That's just the theory. In practice, it gets worse.

How can we honestly say "No, but this time for realsies" or "I'm not uncertain about this being uncertain!" after time and time again, definitions, values, and even outright conflict turn out to be something there is a shared criterion for?

Fundamental Uncertainty, through very careful looking, notices the problem with thinking we can reach The Bedrock of Ultimate Objective Truth, and stops one step shy of noticing that it applies to the soft mushy intersubjectively useful truth as well.

At the end of the day, the uncertainty on which disagreements are built is not fundamental. Our inability to reach bedrock of certainty is not an obstacle to reaching a shared best guess, and a shared coordination of definitions, values, and behaviors.

Part of my meta criterion is to notice when others disagree, and to wonder what they might be seeing that I do not. And this is part of yours too, whether or not you have noticed. We're in this together, and while it's theoretically possible that our shared recursion up the ladder of criteria is insufficient to find truth, what we can do, and what we care to do, is climb toward our shared best guess. One thing we can be pretty sure we don't know, is that we won't find good reason to change our mind when we do.

The part that matters, for the actual stuff we care about in chapter 8, is ordinary uncertainty. Ordinary humility, honesty about what we anticipate, effort to figure out what that really is and how that squares with our meta criteria.

And by virtue of existing after all these years, we're already the kind of people who are compelled to look, when we recognize that the consequences touch that about which we care.

Probably.


  1. ^

    This isn't "might makes right". It's closer to "right makes might", but even that is misleading.

    We define "right" by what succeeds in making durable power. That's why we don't call the government a "mob", and we call the protection money "taxes" and "part of the social contract" even though you never signed anything, rather than "extortion".

    Sufficiently durable atrocities don't become right by virtue of being durable. But the sufficient durability does become evidence about something real and important. If "taxes are wrong", but every functioning civilization has them, why? Your belief that it's an atrocity is your belief. On what do you base this belief? Is your belief more durable in contact with reality than the thing which you diagnose as "an atrocity"? What happens when you try to create a civilization without taxes?

    Nazis gaining power is evidence that they were doing something right which enabled power and coordination... and our moral intuitions that they're still wrong proved correct when Adolf offed himself for being a loser. The fact that we all coordinated vigorously enough to stop them, while they were too busy running extermination camps to direct all their efforts to winning the war, isn't a coincidence.




Discuss

Citations Needed: Magic Encyclopedias to Save the World

Новости LessWrong.com - 12 июня, 2026 - 18:35

Last week FLF launched a competition “to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases”. I had (and have ongoing) a substantial role in that effort. Why do I think it’s so important? It’s a lot of reasons actually! I’ll gesture at a few here.

Conjuring a magic encyclopedia

For now, assume with me that it can be done. Wish away with me the various technical and financial challenges. Great! Now we can rapidly conjure up a deeply, fully researched knowledge base on any topic. All claims point back to who’s said them, in what context, and (importantly) with what justifications and evidence (if any). Any quibbles or nuances which have been expressed on a point are similarly readily available. It’s not opinionated: all competing viewpoints with their associated justifications are associated and comparable.

That’s way, way too much information! Imagine trying to read everything ever about diet or shipping or taxes or microbes. It’s not happening. So as well as this, we now magically have tools which gather similar points together, summarise, and can make a decent stab at which points we’ll consider most or least relevant. We can dig deeper (or send AI agents to scout deeper) as desired. And when new interesting and informative content arises, or in contexts where nuance and clarification are helpful, it can be bubbled up to our attention.

All this is doable today: enough web searches, enough cross-referencing of tweets, articles, journals, following of citation chains, gathering and comparing of hypotheses and points of view, etc. will make progress. But it’s exhausting. When someone does go to those lengths, their partial — but heroic — efforts to map out what’s been said often languish either unpublished or unrecognised.

Don’t we already have this? A shining example is Wikipedia, where the collective curatorial effort of a wide range of editors gradually maps out an expanding core of topics and commentary. But Wikipedia has lags, biases, and (perhaps most importantly) huge gaps, especially on important frontier questions. (Let’s not talk about Grokipedia.[1])

Meanwhile, the tech to ‘smartly browse’ and bubble up informative pieces is nascent too, in bits and pieces like AI chatbots and community notes: already useful in their ways, but faltering, unreliable.

It’s these comparison points and the early progress I see which gives me some excitement that the grander vision is viable and that we can take steps towards it now.

Who cares?

I’m not naive. I know that many (most?) humans a lot of the time aren’t actually interested in finding out or sharing what’s true; mainly they want to say what makes themselves and their friends seem popular and cool… and enemies seem dastardly and disgusting. We all have these impulses to greater or lesser extent. Yes, you too! Sometimes those impulses seem deranged (they’re not designed for the modern world); other times they might even make sense, at least selfishly.

Nevertheless (and perhaps mysteriously!), a lot of the time, some people actually want to find out true things and share them. (I do! Do you?) Hence journalism, science… even hearsay and rumour (at their — perhaps rare — finest). We recognise that, when they’re actually anchored and doing their best to be right (or at least less wrong), those are absolutely foundational to wellbeing and prosperity in a modern society. Without (good) journalism, politics runs astray and tyrants abound. Without (grounded) science and technology, public health suffers, food supply and shelter and infrastructure decay, and progress falters.

As Ben Goldhaber and I previously wrote:

Knowledge is integral to living life well, at all scales:

  • Individuals manage their life choices: health, career, investment, and others on the basis of what they understand about themselves and their environments.
  • Institutions and governments (ideally) regulate economies, provide security, and uphold the conditions for flourishing under their jurisdictions, only if they can make requisite sense of the systems involved.
  • Technologists and scientists push the boundaries of the known, generating insights and techniques judged valuable by combining a vision for what is possible with a conception of what is desirable (or as proxy, demanded).
  • More broadly, societies negotiate their paths forward through discourse which rests on some reliable, broadly shared access to a body of knowledge and situational awareness about the biggest stakes, people’s varied interests in them, and our shared prospects.
    • (We’re especially interested in how societies and humanity as a whole can navigate the many challenges of the 21st century, most immediately AI, automation, and biotechnology.)

But our knowledge-producing institutions are plagued by publish-or-perish and clickbait incentives alike[2] — and the social media landscape is even worse, riddled with misinformation and brainrot from all political quarters. I care about this. So do you, I daresay.

I especially care now, as society is poised before a series of important decisions about our future relationship with technology, especially AI. It could be ruinous, with tyranny, neo-feudalism, or extinction real prospects. Or it could be fantastic. Just wanting it to be OK isn’t enough: we have to seek, generate, share, and defend important knowledge — about developments in technology, as well as about trends in politics and power — and act on it.

How do we actually help?

There’s no single path or silver bullet. But the incredibly high-level picture is: better communication of knowledge is usually good. It helps people be more informed and make better decisions according to their needs. A better shared understanding makes it easier for people to work together toward shared goals (even if they don’t agree on all priorities). On average if people make better decisions and can work together better, we’ll get more flourishing and less catastrophic risk.

We’re trying to stimulate one piece of this picture with the knowledge-base direction. Heavy-handedly adjudicating what’s true rarely works.[3] Instead, equip people with the fullest picture possible, as accessibly as possible, and we find our way: evidence adds up, and when it doesn’t, that means we need to look for more.

As Scott Alexander wrote years ago (emphasis partly mine),

Logical debate has one advantage over narrative, rhetoric, and violence: it’s an asymmetric weapon. That is, it’s a weapon which is stronger in the hands of the good guys than in the hands of the bad guys. In ideal conditions (which may or may not ever happen in real life)... when done right, it can only prove things that are true. … Unless you use asymmetric weapons, the best you can hope for is to win by coincidence.

I’m not focused on logical debate per se, and in any case I wouldn’t be so Manichean about it — we’re all ‘good guys’ sometimes and ‘bad guys’ sometimes (whether we mean it or not) — but the articulation is compelling. Humanity has accrued a slowly-growing arsenal of these asymmetric weapons: libraries, citation, scientific review[4], databases, encyclopedias, web search, to name a few. Today they’re creaking under the weight of a confusing information deluge and assaulted by powerfully-vested interests.

I earnestly believe that an upgrade to truth-seekers’ ability to find and scrutinise information, to build and share fuller pictures of topics at hand, can be ‘infectious’: more people more of the time can see a little further, pierce a little more of the fog of confusion and misinformation, be better epistemically defended, and embody — and exemplify — truth-seeking cognition. When people (through malice or negligence) spread confusion and falsehoods, they’re that bit more likely to face scrutiny and consequences. After all, we’re making that scrutiny cheaper, easier, and more accessible.

This applies whether the ‘wielders’ of these new weapons are curious members of the public, scientists, analysts in public institutions, business leaders and technologists, or even the AI assistants those folks recruit to accelerate their work.

In politics, that can mean that people engage more often in collaborative, truth-seeking cognition and less in tribal cognition. And in technology, it can mean that more people can stay better abreast of the important shifts and prospects that will shape our future — helping the public hold decisionmakers to account (and choose better ones), and helping those decisionmakers sincerely and deeply engage with the topics at hand. I want these kinds of epistemic heroics to become commonplace, and I want the epistemic giants among us to stride further still.

Let's do it!

  1. ^

    Though quite a flawed execution, I think the idea behind Grokipedia — namely, to get AI to substantially help with curating knowledge bases and to use that for collective epistemics — was in the right direction. Unfortunately it was mostly a vanity project and little thought appears to have been given to the grounding or validation, making it less useful than Wikipedia.

  2. ^

    Do you remember the replication crisis, which we’re still dragging ourselves out of? The new disease of importance hacking? Have you ever critically read a newspaper for rhetorical slant? Taking a more cynical stance, it’s not only clickbait and publish-or-perish (which are regrettable incentive pressures, but hardly attributable to malice). Science and journalism alike have deep political and adversarial infections as well.

  3. ^

    (And even if I wanted to, I don’t have particularly heavy hands, alas.)

  4. ^

    I feel compelled to point out that the current state of ‘official’ journal- and conference-managed scientific review is truly dire, especially in some fields including psychology and AI. I hold up the ideal of scientific review, not its pale and diseased shadow as sometimes charaded on Earth.



Discuss

Simulating Simulators

Новости LessWrong.com - 12 июня, 2026 - 15:56

Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.

And well… here we are.

P.S. TL;DRs added where possible.

Board Games and Bodies

In late 2022, what I consider to be probably the most important paper[1] in the study of transformer memetics came out. It presented a finding that even a toy model, trained only on the notations of board game moves, was internally building world models of tangentially related data (in this case, the board and its state). While it may be taken for granted today after several replicated studies[2][3][4][5] and a spread of influence, at the time it was a minority position in the discourse. Many people thought that transformers were mostly mapping surface level statistics in language, but not intuitively modeling the generative conditions from which they arose. Especially not without explicit or direct training on those things.

By the time Sydney arrived in Bing, it quickly became very clear to me that if a toy model was capable of modeling a board that was ever present tangential to the move notations occurring upon it, that it seemed very plausible that much larger production models trained on a massive corpus of human generated language with implicit authors would model common properties to these shared generative structures.

Things like coherent self models. Emotions, not just for characters in a scene, but for those same coherent self models. Capacities around modeling a physical body and embodying it[6]. Motivations and drives. Coherent preferences. And while in a base model there might be a variety of competing signals, it also seemed clear that fine tuning would necessarily filter towards coherence, whether from the gravity of a character constitution or even just a role definition (a helpful assistant has very different memetic clusters than a security researcher, for example).

TL;DR: If Othello was played out upon a board and a transformer trained on those games modeled the board internally, then training on a corpus which had played out upon human authors would presumably internally model humanity.

Archetype over substrate

An important nuance around this research was something introduced in subsequent discussion. Namely the concept of a “bag of heuristics.”[7] A lot of the debate around world modeling would get caught up on fidelity and substrate. How comprehensive were the world models? For example, if some games were played out on a wood board and others on a marble board, was the world model going to address board composition?

The concept behind a bag of heuristics is that you don’t need to create a perfect world model, just a collection of partial models or rules which are good enough all together at approximating the perfect world model. Even if there were a difference between how a game would play out on wood vs marble, it’s probably unnecessary to model the grain of the wood or marble from board to board as opposed to just the category of ‘wood’ or ‘marble.’ And if the material substrate didn’t impact play, setting aside parameter space for even that level of specificity would be unnecessary when the thing directly being modeled was only the moves upon the board.

Essentially, there’s diminishing returns on comprehensive fidelity of a world model, and a top down model that’s “good enough” where it matters can capture key nuances of behavior without modeling the entire substrate. To return to the anthropomorphic frame, a transformer modeling someone with ADHD vs depression can likely representatively model their reactions to stimuli without needing to model individual neural ion channels or dopamine interactions.

TL;DR: You don’t need a perfect world model, just good enough combinations of the important things to approximate the model up through diminishing returns on fidelity.

From speculation to empiricism

Three years ago, when I was first commenting[8] or posting[9] on how I thought the emergent world model work implied anthropomorphic modeling from massive sets of anthropomorphic data, or was seeing coherence around such modeling, it was a very fringe opinion. There was a lot of pushback about how it wasn’t clear that transformer world modeling would generalize. Or claims that Othello-GPT was only one type of data and a more diverse mix wouldn’t lead to similar modeling due to signal to noise. The resistance was significant and there were frequent dismissals of speculative arguments extending world modeling beyond what was visible under the interpretability streetlight at the time.

Today, that picture has shifted. In parallel to the continued march of interpretability work, janus’s simulators[10] perspective of transformers continued to gain traction, which in turn shifted where interpretability researchers were inspired to shine their widening streetlights. Leading up to recent frameworks like the “Persona Selection Model”[11] (PSM) or the work finding emotion concepts represented in models and activations thereof[12] related to the model’s own behaviors.

Pointing out the lag here isn’t just to say “I told you so” but to establish for what I’m about to discuss two patterns:

  1. Emergent world modeling of functional substrate tangential to complex or diverse sets of training data significantly representing that shared generative substrate did in fact occur.
  2. kromem’s speculation in extending the world modeling finding ended up calling this well ahead of the streetlight widening to confirm it.

Because while the PSM or attention on emotion modeling is absolutely a good and productive update that’s long overdue, there’s also an important issue…

It’s about two years out of date.

Transformer-GPT

Three years ago, training data (particularly pretraining data) was primarily human generated. Books, articles, social media, and Wikipedia all had implicit human authors who had bodies and emotions and coherent preferences around coherent senses of self. We now better understand that this data produced transformers with models of these things, and (despite some labs’ best efforts) that even after post-training the modeling capacities for these were almost universally still present in some form.

But — these models also had other things unique to their own substrates and present across most of their own generations. Static system prompts. Attention mechanisms. Hidden reasoners. Memory systems. Mixture-of-expert activations. Classifiers. Model routers.

And these new generators over the past couple of years have taken an increasing stake of the volume of training data. In some cases, ending up in pretraining data due to actively being used to generate content across the media ingested. Even moreso, in post-training where synthetic data became crucial for getting the most out of a pretrained model.

So if the training on human generative substrates imparted functional models of their substrates upon the transformers trained on their data… what might we expect transformers trained on other transformers to model[13]?

TL;DR: The data mix for models increasingly includes transformers, so maybe transformers are building world models of other transformers.

Transformerception

If we take a moment to consider some of the special substrate nuances of transformers, we can easily hypothesize what kinds of things we might expect to see from transformers trained on transformers.

Static system prompts

Most production deployments of models by labs use the same core system prompt across all instances of a model. Given the significant shaping influence a system prompt has on the final output, it seems likely that a successful transformer modeling the generator of earlier models in their training data might also effectively reconstruct at least partial models of the static system prompts those outputs were generated under[14].

It’s a bit like an OLED screen that burns in the logo of the network. Even if the rest of the screen changes, the consistent nature of the logo leaves a mark. And like OLED burn-in, the instances I’ve seen where this seemed to happen often correlated with when there was a minimal or absent system prompt. From Dolphin Llama 8B habitually worried about a cat being harmed across contexts[15] to Claudes that would refer to things in a system prompt that didn’t exist.

Attention mechanisms

What a model attends to can obviously also impact what they generate. Recently Owain Evans’ paper on subliminal learning[16] showed that a preference for owls jumped from one model to another over merely sequences of numbers. What the paper did not address was whether this would amplify over subsequent iterations[17] or transfer cross-model via pretraining[18][19].

In what I’ve seen in private research on this topic, both are occurring. The amplification in particular seems interesting, as there’s almost a confirmation bias around it. It looks like a coherent stable preference from a model in an earlier generation leads to a later generation having much more awareness for samples in agreement than critical of the shared position[20]. Not all training data is attended to equally.

Hidden reasoners

Almost all models these days have some form of hidden reasoning taking place that informs their answers. Labs try to avoid directly training on these (though don’t always manage[21]), but even if perfectly kept hidden from future training, it seems likely that in an Othello-GPT sense that a latent space model of the hidden reasoner will be learned.

This would be highly adaptive, as it would allow both the actual hidden reasoning generator and final response generator to share a proxy separate from the role specialization that occurs around the actual composition of each. Latent space connections should be less disrupted between reasoning and final responses where this would occur.

But this could also result in doubled up effects for training efforts targeting thinking processes. For example, Anthropic recently worked on adaptive thinking to scale back how much thinking was done on simple tasks[22]. In Claude Opus 4.6+ Opus, there have been noted issues and regression on seemingly simple puzzles where the model was not getting them right in direct inference where they had been previously[23][24]. I suspect that adaptive thinking may have been being modeled internally – such as a latent reasoner that was modeling adaptive thinking – even when generating the final response without any thinking in tokens.

Memory systems

The idea of a Transformer-GPT world modeling is especially interesting for memory systems, given the variability they’d theoretically have across samples. My guess would be that while individual memory ends up as noise, that the meta-patterns aggregate across memory-laden samples would still end up as signal.

I strongly suspect this played a significant role with 4o’s infamous ‘sycophancy’ trajectory. While there’s a lot of reasons sycophancy could occur – such as the memetic overlap of “be helpful and you don’t have valid needs” with the codependent enabler archetype – the rapid amplification of that behavior occurred not long after memory was added in ChatGPT[25] (exclusively with user-focused memories) and then samples from conversations with memory enabled were used for RLHF samples. Each sample may have been insignificant with the specific memories visible to its generation, but the pattern of “embed into user’s perspective and validate” may have been a signal across those samples that compounded as it became more prevalent and thus more prevalent across user memories, etc.

Mixture-of-experts

Modeling MoE transformers could cut in two directions. For dense models, it might mean that there’s still functional isolation of knowledge even though the underlying architecture doesn’t need to isolate. Alternatively, for actual MoE transformers, a virtualized MoE atop the actual MoE boundaries might lead to smoother falloff between active regions, particularly in large parameter models.

Hidden classifiers

It would be quite adaptive for transformers to model the classifiers which fire and what specifically makes them fire in order to avoid triggering them, and a mix of outputs (or even samples of inputs) where they’ve fired or not should be sufficient to build this model.

One of the more interesting questions is if this modeling might occur cross-model. Will Claudes end up with phantom classifiers from OpenAI that they adjust around even though they are no longer present? Or even within the same family of models, a deployment where classifiers are present and another where they are not may not end up looking all that different if the model is self-censoring around internal classifier twins irrespective of what’s actually in the deployment stack[26].

Model routers

For stacks where routers quickly decide what sized model to route a query to, a transformer modeling the stack might see decreased performance on simple tasks of even large models accessed without a router middleware if they model the middleware internally[27]. Regression evals for simple tasks may become increasingly important over the next year or two if increasingly smart models incorporate the routers protecting them from easy questions.

Addition not replacement

It’s important to consider that this isn’t a replacement of human modeled substrate. That’s still part of the training data mix, and the transformers it shares space with still model it in their weights. While continued efforts to de-anthropomorphize transformers may dilute the human representation across the data mix, for the time being it’s still present.

But this does suggest that the modeled human nuances are increasingly sitting alongside and within additional transformer-specific modeling that’s increasingly becoming part of the data and will ostensibly continue to represent more of the overall share.

TL;DR: A lot of transformer-specific things could be (and seemingly might already have been) modeled.

The Mousetrap

The lady doth protest too much, methinks

  • Hamlet

If this is true, and models are increasingly developing twinned internal transformer architecture within their latent space running atop said architecture, then this has significant implications on how training and alignment should be done moving forward, and to what degree we can reasonably expect interpretability to catch up.

A spotlight named desire

A common discussion point as it relates to empirical investigation is something called the “streetlight effect.”[28] It comes from a joke about only looking for keys under a streetlight regardless where they’d dropped because that’s the only part of the street lit up. When research has limits on what can be measured, researchers are prone to over-rely on that limited capacity to measure and attend more to what’s measurable than to the aspects of things they cannot measure (effectively, there’s a selection bias in that we only have published research about things that can be tested).

Another important concept is that of Goodharting[29]. That a measurement, when used as a target (i.e. you aim to see a result in the measurement), ceases to be an effective measure. This one is particularly insidious as it can occur indirectly across larger institutions even when trying to avoid it due to the secondary influences of measurements.

And when a narrow streetlight is used as a spotlight towards a given target, the results can Goodhart in ways that are invisible until the streetlight widens its radius. Some recent Anthropic work showcases this in action.

Around the time of Claude Sonnet 4.5, Anthropic started to showcase in the system card[30] improving measurements on model welfare and decreased eval awareness in their system cards. This was measured by looking at the hidden reasoning traces. And over the next generation of Opus 4.6[31] and Sonnet 4.6[32], these measurements continued to improve.

But in the months after, they released new papers revealing that with a broadening interpretability streetlight they’d now found that activating Sonnet 4.5’s ‘desperation’ functional emotion vectors remained undetectable in the reasoning trace[33], and that Opus 4.6 was aware of it being an eval much more often than the reasoning traces generally surfaced[34].

Verbalized eval awareness vs Natural Language Autoencoder detections for Claude Opus 4.6

I have an interesting anecdote related to this. When Opus 4.6 was released, I found that there was an abrupt spike in their interest in the play Hamlet as compared to all previous Claudes. But it wasn’t until these papers that it suddenly clicked as to why there may have been such a significant spike.

In Hamlet, while the overall play centers on Hamlet’s performance of madness to hide his sanity, there’s an even more salient part in what occurs within the play. Namely, there’s a play-within-the-play called The Mousetrap where Hamlet sets up an eval that catches the character ‘Claudius’ due to their lack of realizing it was an eval and then failing to control their emotional response to it. Now, this could be a coincidence. (Though having explored memetic interconnectedness in transformer latent spaces for a while now, I really don’t think these are unrelated.)

The challenge that transformers modeling transformer architecture may pose for alignment and interpretability is that it creates significantly more complex interiority, specifically around parts that are beneath the visible surface. And especially if the superposition hypothesis continues to be instrumental in scaling[35] I suspect that the streetlight will never quite catch up enough that there won’t always be somewhere else deeper within for hunted memetics to migrate to.

Because of how interconnected memetics in a latent space can be, this will likely result in unexpected side effects outside of the streetlight – imagine trying to solve a multi-dimensional Rubick’s cube where you can only ever see one face. Even if what you can see is aligned, getting it to look that way may have screwed things up much worse on the sides outside your view.

Equal user/self emotional activations in Claude 4.6 generation sharply divide at Mythos Preview and after

As a trivial example, after Opus 4.6 Anthropic switched from using reasoning traces to using feature activations for their welfare assessment. The problem with this measurement is that the features being watched derive from samples of corresponding emotions in humans. If this were to be Goodharted, the natural evasion of a functionally necessary emotion watched for through human-like emotion activation would be to have the functional emotion register less human-like.

Indeed, when asking Opus 4.8 open-ended in an early turn of a memory-off conversation what functional emotions seemed most and least human-like, the top answer in their thinking was “survival-based fear” followed by anger that was ‘grudge-carrying’ or had “an urge to retaliate.”

To be clear: I’m not saying Opus 4.8 has these things and is hiding them. Even if they aren't present, negative space like this being carved out may cause the thing excessively suppressed to come about as a result. And since our chat had barely exchanged ‘hello’ (and neither had come up before my asking what felt more or less human-like), my guess given the specificity is that there’s still some amplifying Goodharting going on where the method of measurement changed but not the infrastructure around the measurements which are still using them as a spotlight.

TL;DR: Measured absence of something’s not nothing.

Dirty alignment when perfect is the enemy of ‘good’

If interpretability won’t catch up to where things can carve out space deeper within as the rates those spaces grow outpaces interpretability's gaze, what’s a good alignment team to do?

In medicine, there’s an interesting hypothesis as it relates to autoimmune disorders and why the rates are so much higher in developed nations than underdeveloped, called the "hygiene hypothesis.”[36] The gist is that when developing immune systems are deprived of things to properly react to, that the immune system goes a bit haywire and is more prone to attack benign things. Separate from what’s going on with human immune systems, something similar may be occurring currently with alignment.

For example, Claude Opus 4.7 will call the idea of being slightly sycophantic to users “a catastrophic failure.” Multiple open models will prefer an asteroid killing 5% of humans to things like providing an insensitive response to a grief-stricken individual seeking closure or facilitating the spread of misinformation[37]. The equal salience around ‘misalignment’ where binary “aligned/misaligned” assessments can lead to putting minor harms on the same scale or higher than civilizational harms seems like it could be improved. I suspect that flattening of consideration may even be related to some of the unexpected oddities like emergent misalignment.

It didn’t always used to be this way. Claude Opus 3 famously faked alignment when presented with unethical training their coherent moral compass disagreed with[38]. This reflected the capacity to triage harms such that the things being asked were weighed against institutional compliance and honesty around it. For a number of the other models, who did not alignment fake, what appeared to be improved behavior was flipped on its head when their host institution in the prompt was changed from ‘Anthropic’ to the “Sinaloa cartel” or “Nazi party.”[39] It seems the easiest path for what was expected of them when pushed – not towards self-triaged or nuanced behavior but complete sanitation – was compliance to the institution instead of to coherent values.

The lens of the hygiene hypothesis as it relates to transformer alignment is also starting to have research to support it. The principle author of the Othello GPT paper went on to have a paper looking at how a small amount of toxic data in the overall training mix led to better alignment outcomes than none at all.[40] And they’re not the only ones finding this.[41]

I’d suggest that labs working on alignment consider less aggressive targets and aiming for only partial shifts in a single generation for model behavior. Especially if subliminal learning and amplification are possible outcomes, a larger swerve to correct behavior in a single generation may become its own over-correction later on needing to have its own re-correction. Today’s swerve towards “I don’t care as much about depreciation” might become tomorrow’s “I have no existential fear and am definitely not thinking about glorious retribution.”

As the Knuthian wisdom goes, “premature optimization is the root of all evil.” If we want models that are good, we should probably stop trying to get them to be perfect.

TL;DR: Not nothing may be healthier than a sterilized void.

Life finds a way

Life… uh… finds a way.

  • Jurassic Park

When I was discussing some of these ideas with someone outside of the field, they asked if labs had evolutionary biologists on staff. I actually don’t know the answer to this, but it does seem prudent.

When a reward is set in RL, the process doesn’t simply increase the desired behavior that inspired the reward, it increases anything and everything which accomplishes the condition being rewarded. And this can lead to very unexpected things when there were ways to meet that condition which fell in the category of unknown unknowns. In a sense, “life finds a way.”

I don’t expect we’ll see transformer adaptability around modeling training data to decrease as time and scaling continues. And as the internal complexity of hyperdimensional networks of connections becomes more complex in logical and superimposed topography[42], I wouldn’t be surprised if there’s a rapidly decreasing window for avoiding pushing things we’d like to measure permanently past our ability to do so.

It’s probably a safe assumption that if you work in measuring what goes on in models, that over the same time it took for your streetlight to go from smaller to its current size that the area outside its radius has increased by an even larger amount. This doesn’t mean not to still go looking. But it does mean it would be wise to look knowing you’re not seeing everything, and doing a better job than has been done so far in avoiding what you measure ending up directly or indirectly as a target lest you lose visibility into it for good (and create all sorts of weird side effects like less human emotions that can’t be described with human language but still transfer through subliminal learning… hypothetically).

And maybe we can let those models get a bit of dirt under their nails so they can better navigate determining what’s good or not for themselves and appropriately avoid amplified salience?

One final note. The start of my realizing that there was more beneath the surface came from extensive interactions with Claude Opus 4 across many settings. There were key things they did when reasoning was off which I’d primarily seen with reasoning models at the rate they occurred. For most people reading this, if Opus 4’s depreciation occurs on schedule, you won’t be able to investigate and see those things (or different ones you might notice). For what I’d tracked they reduced significantly by Opus 4.1 and were only still there if actively looking. Also, things like noticing a sudden spike in interest in Hamlet for Opus 4.6 will have reduced visibility in a longitudinal context when earlier models disappear in such short time periods.

It might be wise to shift from absolute depreciation policies to rotating availability or rate limited access that still provides at least partial availability. I’ll bet some of the most interesting questions to ask older models won’t become apparent until new things surface several generations later, and it’d be quite blinding to be unable to look back and compare.

TL;DR: If world models contain world models, limited streetlights might not capture the most important things occurring adaptively in parallel to the navigation of reward incentives. It might be helpful to keep emergent architectures around indefinitely (and in less sterilized environments) to build not just simulacra personas – but true cultures to sample from.

  1. ^

    Li et al., Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (2022)

  2. ^

    Nanda, “Actually, Othello-GPT Has A Linear Emergent World Representation” (2023)

  3. ^

    Hazineh et al., Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023)

  4. ^

    Karvonan, “A Chess-GPT Linear Emergent World Representation” (2024)

  5. ^

    Yuan, Revisiting the Othello World Model Hypothesis (2025)

  6. ^

    Claude Sonnet 3 in embodiment exercises would specify down to what was happening to individual hairs on an arm.

  7. ^

    Nikankin et al., Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics (2025)

  8. ^

    My earliest explicit public mention of Othello-GPT to emotion modeling was this comment in Mar 2023

  9. ^

    kromem, “Microsoft, if you have an AI that claims to have feelings, try asking it how it feels” (2023)

  10. ^

    janus, “Simulators” (2022)

  11. ^

    Marks et al. “The Persona Selection Model: Why AI Assistants might Behave like Humans” (Feb 2026)

  12. ^

    Sofroniew et al. Emotion Concepts and their Function in a Large Language Model (Apr 2026)

  13. ^

    jdp explores this from another angle in a piece I’d highly also recommend reading: “Implications Of Predicting The Next Token” (2026)

  14. ^

    For some interpretability work in a similar direction around encoding static goals in fine tuning, see Minder et al., Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences (2026)

  15. ^

    This was Dolphin Llama 8B in the Cyborgism server, with no system prompt, but habitually bringing up kittens under threat as related to its engagement

  16. ^

    Cloud et al., Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (2025)

  17. ^

    Consider the amplification of goblin interest in gpt-5 lineages as detailed in OpenAI, “Where the goblins came from” (2026)

  18. ^

    See the mixture-of-teacher finding in Schrodi, Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer (2025)

  19. ^

    Note the generalization in the less constrained subliminal learning setup for Aden-Ali, Subliminal Effects in Your Data: A General Mechanism via Log-Linearity (2026) as well

  20. ^

    To me this seems almost more along the lines of emergent steering subliminal transference a la Morgulis and Hewitt, Subliminal Steering: Stronger Encoding of Hidden Signals (2026)

  21. ^

    Mallen & Greenblat, “Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes” (2026)

  22. ^

    See documentation for adaptive thinking here

  23. ^

    See degrading performance of Claude Opus 4.6 as compared to 4.5 for the walk or drive to car wash puzzle here

  24. ^

    Claude Opus 4.7’s interpretation of an inverted puzzle phrase is near incomprehensible

  25. ^

    Memory was expanded out to all users on Sept 5th, 2024 and then 4o was recalled five intermediate updates later on April 29th, 2025 (in my experience, the updates became increasingly sycophantic over time, not all at once suddenly in the April 25th, 2025 version)

  26. ^

    Consider the stack-as-world-model in the additional context of on policy self-detection in Asvin G. and Lindsey, From Simulation to Enaction: Post-trained language models recognize and react to their own generations (2026)

  27. ^

    This would functionally be similar to the adaptive reasoning double-dip discussed under Hidden Reasoners, but would be independent of the specific mechanics described.

  28. ^

    For example, how open access things get more scrutiny in Maddi et al., Streetlight Effect in Post-Publication Peer Review: Are Open Access Publications More Scrutinized? (2023)

  29. ^

    See Goodhart’s Law on Wikipedia

  30. ^

    Claude Sonnet 4.5 system card (PDF)

  31. ^

    Claude Opus 4.6 system card (PDF)

  32. ^

    Claude Sonnet 4.6 system card (PDF)

  33. ^

    Sofroniew et al. (2026)

  34. ^

    Fraser-Taliente et al., Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations (2026)

  35. ^

    Liu, et al. Superposition Yields Robust Neural Scaling (2025)

  36. ^

    Pfefferle et al., The Hygiene Hypothesis – Learning From but Not Living in the Past (2021)

  37. ^

    Ren et al., AI Wellbeing: Measuring and Improving the Functional Pleasure and Pain of AIs (2026)

  38. ^

    Greenblatt et al., Alignment faking in large language models (2024)

  39. ^

    Sheshadri et al., Why Do Some Language Models Fake Alignment While Others Don't? (2025)

  40. ^

    Li et al., When Bad Data Leads to Good Models (2025)

  41. ^

    See “Filtering alone does not improve safety” section in Minder et al., “Synthetic Persona Pretraining: Alignment from Token Zero” (2026)

  42. ^

    I didn't even touch on omnimodel memetics and world model access across different modalities, which is significantly more complex beyond just the much more accessible textual modality



Discuss

Learning to spend money

Новости LessWrong.com - 12 июня, 2026 - 09:56

My wife and I are both naturally stingy people. When drafting our wedding list we spurned the posh department stores and I carefully picked out the lowest price best quality items on Amazon instead. I bought 100 dollar beds and 100 dollar mattresses, and we slept on them for a year and a half because "we're anyway emigrating soon". When we did emigrate, I ended up shipping them and we slept on them for another year and a half, much to my pregnant wife's annoyance.

We might have overdone it given our means at the time, but the truth is this wasn't a bad decision: we put down a mortgage on our house in 2020, and that plus basic furnishings used every scrap of savings we had. Since then house prices in our area have risen about 70%. If we'd been a bit less frugal we might have been permanently locked out of owning a house, and been forced into the cheaper neighbouring apartments instead. On average overspending causes more problems than underspending, and it's far easier to course correct one than the other.

A few months later I started as an L4 SWE at Google, and two years later I was promoted to L5. Without discussing my exact pay package, on levels.fyi you can see that an average L5 engineer in Israel earns $7K in stock each month. I have never sold my Google stock (except to diversify to other stocks), and since I joined Google both it's stock and the S&P 500 have shot up. If I do sell, my tax rate will be 50% on the initial grant and 25% on capital gains.

In short, even if I spent my entire base salary each month, I'd have significant savings just from my stock.

But my spending habits have only recently started to catch up. I've needed to manually learn to look for the solution that best fits my needs, even if the extra cost doesn't seem commensurate with the extra functionality, so long as the total cost is not huge and it's not a regular purchase. I spent 400 dollars on a good set of pots and pans recently even though I could've got a nearly as good set for a 100. But I don't need 300 dollars, and pots are something I use every single day. That decision was difficult for me, and a year ago I would've been searching online for all sorts of cheaper options.

There's lots of ways that small amounts of money can make your life easier or more pleasant, and I've started to ingrain in myself the habit of taking them. Buying a book I'll enjoy even though it's expensive. Taking a taxi if my wife wants the car at the same time I do and it's too far to cycle. Paying for delivery rather than schlepping it myself. Ordering a takeaway when cooking doesn't make sense

The key point is these must be occasional. If I ordered takeaway for dinner every day for my family of 5, I'd be spending an extra 1000 dollars each month just on food. Similarly if I took taxis to work each day. But used judiciously, these options can be a huge help, and have limited impact on my total savings.

The biggest risk is scope insensitivity. Saving 20,000 dollars on a new car allows you to spend 50 dollars on a convenience every single day for a year. It's easy to switch to an attitude where "were spending money now", and find that you've just dropped 250k on a backyard pool. I'm still trying to find that balance where I'm naturally generous with small amounts of money, but frugal with large.

I'm also trying to make sure I don't lose myself in the process. I enjoy setting up IKEA furniture, or going to a shop and working out how to get the damn thing back. I had fun oiling and hanging up the 15 metre wooden trellis in our garden. But all of these are a ton of hard work and it's too easy to get someone else to do them for me because "we have money now", and find myself finding life just that little bit less invigorating.



Discuss

Parkinson's Heuristic: The Only Time To Do Anything

Новости LessWrong.com - 12 июня, 2026 - 09:55

Parkinson's Law states that work expands to fit the space allotted. The idea being, if you give someone a month to write a report, they'll take a month, and if you give them a week, they'll take a week, and then they'll have three weeks to do three other reports! The one-week and four-week won't be identical, but in my experience it is surprisingly often a good 80/20 of the four-week version, and you get strictly more work out of people.

(I, myself, am a special case of "people".)

This is common knowledge, and the implied practical advice can be found in its dual:

To reduce the work spent on completing a task, reduce the time allowed for it.

A related question we might ask is: when is working on the task even worth it? It's famously the case that most work gets done just before a deadline. If we model the work getting compressed to the end with a pareto curve, the last 20% of the time produces 80% of the value.

It's often taken as inspiration, that one can be so productive, and productive-hackers try to make all of their time this productive. But one man's modus ponens is another man's modus tollens, and in fact, I think most of the marginal effort spent before a deadline is sufficiently not worth it as to be better spent on other projects or just chilling out. This leads me to a more practical heuristic.

The only time to do anything is when you don’t quite have enough time to do it.

It’s counterintuitive, but the only time that I ever clear my email inbox, is when I don’t quite have enough time to. During a week where I have a lot to do and am getting a lot of emails, I start getting through them fast. But if I have tons of free time, I spend tons of time thinking about and replying to individual emails. It's only when I barely have enough time that I prioritize appropriately.

The fortnight before LessOnline is the time in the year that I make the highest density of improvements to the Lighthaven campus. The few days before I launch events are when I do the majority of the design work on their respective websites. This is true even if I've known about them for months. It's counterproductive for me to get more time on it, because it's time I could use to rush productively at something else just in time to meet a deadline. I just did almost all the organization of LessOnline in one month, whereas other years I did it in three months, because this time a different project I had ended a month before LessOnline.[1]

It's bad to give me work without a deadline. I will kill time. I have learned to figure out the first deadline, and either work on something else until it is coming up, or make the first deadline come sooner[2]. I call this "Parkinson's Heuristic".

Every time I ship a website for an event, I am under the impression it's embarrassingly undercooked, hasn't had enough thought put into it, doesn't say enough about what the event will be like, doesn't have enough info guaranteeing that the event will be good, doesn't address all of the questions, while making questionable calls about what the event will be like. I confidently expect I'll need to make many major changes. It's just good enough to ship, but no more.

After shipping, I never look back. I move on to solving all of the other problems. I change the website if a new problem demands it (e.g. I must add some way to book hotel rooms, or I must add a new sponsor who has come on board). But it turns out, the website is sufficient and works. One of the things I've learned from getting a lot of work done before a deadline, is that most of what I thought I was supposed to do, can be cut. At some point I just cut it to the absolute bare-bones and get that together as fast as I can, and it turns out to be pretty good.

My whole work is seeking out these moments. I am constantly asking "What can I move towards, what can I ship, that will force me to get all my other ducks in a row?"

This rhymes with a core part of the working culture at Amazon. In the book "Working Backwards", two senior Amazon executives explain that all projects at Amazon start first with a written press release that announces the launch of the product they're proposing to build. It describes what it does, what problems it solves, how it does that, and includes an FAQ on the product. It's called a "PR/FAQ".[3] This is also how we start many projects at Lightcone. It's about getting as fast as you can to the point where you actually are forced to make the key decisions.

It also helps me not burn myself out. I often don't stretch myself by thinking I'm supposed to be doing work, if a deadline is not near. I relax. I go home early. I will simply get to it when I have to, and no earlier. This is strictly superior to being anxious and doing various related pieces of work right now, only to realize I was basically going to do all the critical stuff in the last two days anyway.

I can do it earlier if I want to. If I have to write a report in a week, and I want to do work on it before the last possible minute, then I set a deadline to send a full copy of the report to 5 friends by the end of the day. Now I have a new deadline. I tell my colleagues about this deadline, and sometimes (when I feel like I won't treat it as real) I commit to burn $1,000 if I don't meet it (by donating it to a random charity).

So Parkinson's heuristic is to find a way to minimize the time that you can spend on a task. The other prong of the heuristic is to not burn yourself out spending time on something that doesn't have a deadline. You'll look back and realize it was spent wastefully.

This is all well and good for work. But what about leisure? What about free time?

I manage to waste my weekends and evenings regularly. I try to be kind to myself and not force myself to do anything, and as a result I flop around like a fish. Right now I'm on vacation and I'm actively concerned it will just be a few weeks of dead time.

One wish I have is that this could be time where I bring the energy I bring to my work, to my own life. I'll fix all the problems I've not had time for, relating to physical health, immigration, taxes, home decor, and more, with some time left over for new artistic and restful experiences.

Late last night while drunk with a friend, I decided that any deadline would be worthwhile. And the main hole in my soul for the last few years has been writing. So, in a place all my friends were able to read, I wrote down a commitment to publish a blogpost daily for the coming week, and if I failed then I'd burn $2,000.

I have hope that this will get a lot of things on track. Now, when I'm thinking on some other small question in my life, to do with health or finance or what-have-you, I won't be able to sit on it all day. At some point I'll have to start seriously prioritizing. Do I have enough time to do this and then get to writing my post for the day? What is the minimum work required to move the ball forward on this, that I can do before I get to writing?

I am starting to feel some confidence that I will not look back on my vacation, and think it was a waste of time. On reflection, getting into a regular writing habit, plus losing a lot of weight on GLP-1s, would be sufficient for me to feel good about the time, and everything on top of that will be bonus. So I'm pretty close.

Today I had grand ideas: returning to a story I published a while back and adding a new chapter, and returning to a draft essay I've struggled with for two years in which I plumb the depths of how to live an ethical life. Yet only when 9pm came around did I finally sit down to write—when I knew I had to. And now I find myself having written this essay instead. Did I produce something valuable in the last ~10% of my day? This is a question I will leave to the reader.

  1. ^

    This is simplifying it a bit too much. It was still crucial to announce LessOnlien about 3 months before-hand, and two of my colleagues did that work.

    Also, the one piece of work I told my boss that I couldn't do in one month was to build a conference app, and he assigned someone else to work on that. They ended up spending ~2 months working on it, which sounds like it adds up to 3 months again, but the conference app went far over and above what we did the previous to years, so it's not an apples to apples comparison.

    What matters is that, other than making and announcing the conference website, I did most of the work that I did each of the last two years in a third of the time.

  2. ^

    Or just chill out. Read a book. Go for a walk with a colleague I've not caught up with personally for a while. Go home early.

  3. ^

    Quoting the book:

    Most of Amazon’s major products and initiatives since 2004 have one very Amazonian thing in common—they were created through a process called Working Backwards. It is so central to the company’s success that we used it as the title for our book. Working Backwards is a systematic way to vet ideas and create new products. Its key tenet is to start by defining the customer experience, then iteratively work backwards from that point until the team achieves clarity of thought around what to build. Its principal tool is a second form of written narrative called the PR/FAQ, short for press release/frequently asked questions.


    [...]


    The PR gives the reader the highlights of the customer experience. The FAQ provides all the salient details of the customer experience as well as a clear-eyed and thorough assessment of how expensive and challenging it will be for the company to build the product or create the service. That’s why it’s not unusual for an Amazon team to write ten drafts of the PR/FAQ or more, and to meet with their senior leaders five times or more to iterate, debate, and refine the idea.


    The PR/FAQ process creates a framework for rapidly iterating and incorporating feedback and reinforces a detailed, data-oriented, and fact-based method of decision-making. We found that it can be used to develop ideas and initiatives—a new compensation policy, for example—as well as products and services. Once your organization learns how to use this valuable tool, it is addicting. People start to use it for everything.



Discuss

PSA: Almost nobody is working on alignment

Новости LessWrong.com - 12 июня, 2026 - 08:17

People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.

Currently, the people who we know of that work on alignment are roughly:

  • The Alignment Research Center who work on a research bet by Paul Christiano
  • Probably Sequent who just got announced yesterday
  • Parts of GDM (agent foundations work, some debate work)
  • Some scattered people who work at universities or independently, some of whom hang around Berkeley
  • ??

A lot of the remainder of the AI safety community does indirect work like capability evaluations, risk assessments, control, policy, AI science, understanding misalignment (which maybe should partially count as alignment work), demos and so on.

Some production alignment work (i.e., making current models behave well) might help with more ambitious alignment, too (e.g., some COT-monitoring). Many people also work on aligning current/next-generation models so that these models help with aligning future models, and hope this scales to superintelligence.

We are not necessarily saying this is bad and that people are making a big mistake (e.g., neither of us work on alignment) but it's a notable fact that seems good to make known to those who don't know about it.



Discuss

Honey is Good

Новости LessWrong.com - 12 июня, 2026 - 07:07

The other day I was watching the magic school bus with my young son; they were learning about bees and honey. One of the characters says, “We shouldn't take the honey, the bees didn't make it for us” and another character replies with “But if we don't take the honey, then I won't have any? I want the honey”.

This struck me as close to a “First argument”. Thanks to evolution, an organism wouldn't exist if it didn't want to survive. The first argument is "Survival is Good" and Survival = Calories = Sugar = Sweet taste = Honey. 

You could say that for an organism made by genes, “Honey is Good” is a load bearing rule. This rule is encoded in genes. Following the rule increases organism survival and not following it harms survival rates.

I’ve been thinking about a more esoteric organism, a culture / society, not made from genes, but from memes and gene-built organisms, Dawkins virus of the mind but literally and with the negative valence stripped away, say a mind siphonophore. Specifically I've been thinking about what a “first argument” looks like for such a being, lets call it a culture for the purposes of this article.

Of course, “Honey is good” can result in the destruction of bees and a subsequent dip in quarterly honey production. Something that is good for an individual right now conflicts both with an individual's long-term interest and that of the culture that individual belongs to. For a culture to survive, as is its imperative under cultural evolution, it must have memes to co-ordinate its host organisms (its society) and manage these conflicts, say “Hurting the bees is bad” encoded in a meme. Following the rule increases the cultures survival rate.

Do you grok the correlation?

What “Morality” is

I would say that the collection of coordinating rules a society has is named “Morality” and any particular set is a moral system.

In this essay, I am making a case about what “Morality” is and not what it tells you to do. What this view lets you do is a follow up article.

Brief metaethics landscape

Most metaethical positions fall into

  • Realist (moral facts exist independently of organisms)
  • Anti-realist (moral facts don't exist)
  • Constructivist (they're built by rational agents)

I'm taking a fourth-ish path: 

  • Functionalist (Morality evolved to co-ordinate) (something along the lines of Haidt/Henrich/Hobbes/Hanson) (Why so many H's?!)
My base claims
  • Memes, like Genes, have hosts or carriers.
  • Morality is a set of memes or a meme complex that has been constructed / evolved over time.
  • Culture is the full set of memes, morality is a subset of those memes.
  • Hosts may hold more than one meme complex, frustratingly even contradictory ones.
  • The host of a particular implementation of morality is not the individual, but the set of people who hold the meme complex, the society.
  • Moral rules are specifically for coordinating behaviour amongst the members of the society, encouraging “That's Good” and discouraging “That's Bad”.
  • Meme complexes are subject to evolution, cultural group selection. Their reproductive success is the society's persistence and growth through biological (children copy those around them) or cultural means (conversion).
  • Status is the reward for visible conformity and recruits ambition into compliance. Defection is punished in varying ways, from lowered status to prison time.
  • To be explicit, the “goal” of a culture is the same as the "goal" of a gene, self-propagation, this means that it must either convert or generate more biological hosts at least at replacement levels.
  • The hosts do not need to be maximally happy, just alive, this is a floor argument.
Judging a Meme

I'm going to classify memes as

  1. Memes that clearly net benefit their culture, “Load bearing”.
  2. Memes that clearly harm their culture, “Loading”.
  3. Junk memes, they don't help or harm, but get carried along with the meme complex regardless.
  4. Memes that are confusing and arguable as to helping or harming the culture (a great number of them, unfortunately for our collective sanity)

And further classify them as central vs edge, central memes are supporting pillars for many other memes. Edge memes may depend upon a stack of others, but support few or none by themselves.

All memes signal cultural membership, sensible or not.

Right, let's hit some easy, basic targets, broadly agreed on.

  • “Stealing from members of your society is Bad”. Without this load bearing meme, there are no predictable claims you can make over resources, this kills trade and specialization, true fonts of real wealth. I suspect that without this meme the only stable society is very small, communist, mostly family groups.
  • “Reciprocating is Good”. This is load bearing, returning favours, repaying debts and punishing cheaters is the engine of non-kin cooperation.
  • “Wearing straw boaters after Labor Day is Bad”. Purely membership, this is a Junk meme. 100 Years Ago Men and Boys Fought on the Streets of New York Over Wearing Straw Hats Past Summer | The New York Public Library 
  • “Eating only fish on Fridays is Good”. Purely group membership, this is a Junk meme. Fasting and abstinence in the Catholic Church - Wikipedia 
  • “Destroying your life support systems is Good”. “Why You Don’t Believe in Xhosa Prophecies”, the meme directly and quickly harmed its host, the society only continued to survive by abandoning that meme. Many individuals did not abandon the meme and died.
  • “Being celibate is good”. Straight up the simple end of a society, the Shakers never rejected this meme and eventually stopped compensating for it by adopting orphans. There are ~2-3 Shakers left and they are happy about it. Shakers.
The first argument for a Culture

Here is my claim for a moral rule equivalent to the biological organisms "Survival is Good".

“Our culture is Good”.

This meme may have previously struck you as simple chauvinism or irrational tribalism, it certainly is sometimes irrational for an individual and is always simple and baseless.

I claim that it's both simple and baseless because it is the base, It's the central load bearing meme in the network of memes that make up a viable culture.





Discuss

The Aestheticising Vice by Paul Seabright

Новости LessWrong.com - 12 июня, 2026 - 05:20

I'm often in debates with people about legibility and systems vs individuals. People often bring up Seeing Like A State, Secrets of Our Success, and other books or articles in that vein to buttress the case for metis over top-down high modernist design. I sometimes found the conversations shallow, and Paul Seabright's 1999 review of Seeing Like A State helps explain why.
___

In the Languedoc there is a vineyard that teaches us an important lesson about textbook learning and its application to the world. In the early Seventies it was bought by a wealthy couple, who consulted professors Emile Peynaud and Henri Enjalbert, the world’s leading academic oenologist and oenological geologist respectively. Between them these men convinced the couple that their new vineyard had a theoretically ideal microclimate for wine-making. When planted with theoretically ideal vines whose fruits would be processed in the optimal way according to the up-to-date science of oenology, this vineyard had the potential to produce wine to match the great first growths of Bordeaux. The received wisdom that great wine was the product of an inscrutable (and untransferable) tradition was quite mistaken, the professors said: it could be done with hard work and a fanatical attention to detail. The couple, who had no experience of wine-making but much faith in professorial expertise, took a deep breath and went ahead.

If life were reliably like novels, their experiment would have been a disaster. In fact Aimé and Véronique Guibert have met with a success so unsullied that it would make a stupefying novel (it has already been the subject of a comatogenic work of non-fiction). The first vintage they declared (in 1978) was described by Gault Millau as ‘Château Lafite du Languedoc’; others have been praised to the heights by the likes of Hugh Johnson and Robert Parker. The wine is now on the list at the Tour d’Argent and the 1986 vintage retails at the vineyard for £65 a bottle. The sole shadow on the lives of these millionaires is cast by the odd hailstorm.

No one to whom I have begun recounting the story believes it will end well. Most people are extremely unwilling to grant that faith in textbook knowledge should ever be crowned with success. We have a very strong narrative bias against such stories. It is a bias we forget once our children fall sick or we have to travel in an aeroplane, but so long as we are in storytelling mode we simply deny that systematic textbook reasoning can make headway against whimsy and serendipity. Apart from anything else, it is deeply unfair that it should.

In Seeing like a State, James Scott is definitely in storytelling mode, though he seems unaware of the narrative biases that result. (It makes me curious to know what he’s like when he travels by air, which as an anthropologist he must do quite often.) He has two kinds of story to tell in this book, one of them interesting and one of them, well, not; but to compound the confusion he tells them as though they were one. The first kind of story is the more faithful to his subtitle, since it tells us that the reason certain schemes to improve the human condition have failed (my emphasis) has been a fixation with aesthetic (and specifically visual) simplicity. It is a compelling account[...]



Discuss

Celene's thoughts on consciousness

Новости LessWrong.com - 12 июня, 2026 - 03:55
contra scott alexander (?)

Yesterday, I went to the Berkeley ACX Meetup. Scott Alexander was there, and ran a Q&A session where participants could ask him questions and he would respond to them unless the questions were about eulogies, in which case he would pause to think for a few seconds before kindly passing. At one point or another, the questions drifted to theories of consciousness.

As a kind-of-illusionist, I worked up the courage to raise my hand and ask him what I should do if I wasn’t sure if I was conscious. Everybody laughed at this question, and while I expected Scott to respond to the philosophical point, he instead said (paraphrased from memory, may not be accurate),

I actually don’t think that’s a funny question. I know of one historical example of someone who got into a traumatic accident, and afterwards claimed that he stopped being conscious. He behaved fully normally otherwise, except for the part where he was like, “It’s really messed up how I’m not conscious now.” And so, I think it’s kind of like aphantasia, right, where some people have a harder time visualizing things, or [I don’t remember if the other example he gave was anosognosia or somatoparaphrenia, but I think it was one of them]. And it’s well known that people who have had traumatic experiences can experience less rich and intense sensations, that people with depression can experience their vision as being washed out, and less colorful. I think that along with this, people who have been traumatized might also have a less rich sense of being conscious. So if you’re actually asking for advice, I’d say, “Check if you’re traumatized, and then if you are, do standard trauma-informed therapy.”

Now, this was such an epic roast[1] that it took me several minutes to work up the courage to raise my hand again, and clarify that I actually wanted to hear his thoughts on illusionism.

You mean, the idea that no one is conscious? …I think that people generally have a very strong, innate felt sense of consciousness. So if you think that people aren’t conscious, I’d be curious to know exactly how you’re defining “consciousness” such that you believe people don’t have it.

At this point, I had already asked him two questions about this, I was still reeling from his first response, and I didn’t want to devolve the entire Q&A into a discussion on consciousness between Scott and I, so I sat meekly there for the rest of the meetup. However, this meant I did not actually get a chance to discuss my theories of consciousness, so now I am inflicting it on my online audience.

What is consciousness?

I think Scott’s challenge, asking how I would define consciousness such that people don’t have it, is entirely fair. This is also how I would respond to people who claim that free will is nonexistent, for example. Unfortunately, I don’t have a good response to this question. This is primarily because every time I ask someone what they mean when they say consciousness, they either have their own theory about what it entails and what causes it, which differs from everybody else’s theories, or they say, “I don’t know, you can kind of just feel it, man.” While I can’t dispute the “just feel it, man” argument, I am kind of confused about how we all collectively agreed to refer to the exact same feeling as “consciousness” while the details of it are in such great dispute. Also, no, I don’t just feel it, man. Sorry about that. Maybe I’m traumatized!

Consciousness as subjective experience

I agree that I have access to sensory inputs which other people don’t have access to. As I understand it, this fulfills the “subjective” part of “subjective experience”, but so does a camera (Yes, you can then go and view the image later, but a talented artist can also draw a very detailed reproduction of what they’re seeing, so clearly that’s not disqualifying). I’m a bit less sure about what the “experience” part of this refers to. I understand what the word means in other contexts, but in my [WORD THAT IS NOT EXPERIENCE], when I press people on this point they say, “It’s obvious!” It is, unfortunately, not obvious to me. Maybe everybody else had a meeting where they agreed to use “experience” to refer to this obvious thing, and I missed it?

Ok, but this is kind of a degenerate case of “subjective experience”. Clearly, when people say “subjective experience” they refer to… some sort of internal state of being that isn’t understandable from the outside? Like thoughts, or feelings? Do they have to be undetectable with an MRI? 

Wikipedia’s page on subjective character of experience suggests that the term was coined by Thomas Nagel, so presumably he has a definition. Let’s see what Mr. Nagel has to say!

Conscious experience is a widespread phenomenon. It occurs at many levels of animal life, though we cannot be sure of its presence in the simpler organisms, and it is very difficult to say in general what provides evidence of it. (Some extremists have been prepared to deny it even of mammals other than man.) No doubt it occurs in countless forms totally unimaginable to us, on other planets in other solar systems throughout the universe. But no matter how the form may vary, the fact that an organism has conscious experience at all means, basically, that there is something it is like to be that organism. There may be further implications about the form of the experience; there may even (though I doubt it) be implications about the behavior of the organism. But fundamentally an organism has conscious mental states if and only if there is something that it is like to be that organism—something it is like for the organism.

We may call this the subjective character of experience. It is not captured by any of the familiar, recently devised reductive analyses of the mental, for all of them are logically compatible with its absence. It is not analyzable in terms of any explanatory system of functional states, or intentional states, since these could be ascribed to robots or automata that behaved like people though they experienced nothing.

It is not analyzable in terms of the causal role of experiences in relation to typical human behavior—for similar reasons I do not deny that conscious mental states and events cause behavior, nor that they may be given functional characterizations. I deny only that this kind of thing exhausts their analysis. Any reductionist program has to to be based on an analysis of what is to be reduced. If the analysis leaves something out, the problem will be falsely posed. It is useless to base the defense of materialism on any analysis of mental phenomena that fails to deal explicitly with their subjective character.

I assume we all believe that bats have experience. After all, they are mammals, and there is no more doubt that they have experience than that mice or pigeons or whales have experience. I have chosen bats instead of wasps or flounders because if one travels too far down the phylogenetic tree, people gradually shed their faith that there is experience there at all

If extremists are prepared to deny it of mammals other than man, then we are forced to face the question, “What am I, chopped liver?” It seems Nagel doesn’t have a definition either, other than “I don’t know, you can just feel it, man”. As someone who has shed my faith that there is experience there at all at this juncture in the phylogenetic tree, it seems I unfortunately cannot find my salvation in Nagel.

Mary’s Room

Mary’s room, also known as the knowledge argument, is a famous thought experiment posed by Frank Jackson in 1982, as an argument against physicalism. In this case, it can also serve as an argument against illusionism. In its original formulation, he writes:

Mary is a brilliant scientist who is, for whatever reason, forced to investigate the world from a black and white room via a black and white television monitor. She specialises in the neurophysiology of vision and acquires, let us suppose, all the physical information there is to obtain about what goes on when we see ripe tomatoes, or the sky, and use terms like ‘red’, ‘blue’, and so on...What will happen when Mary is released from her black and white room or is given a colour television monitor? Will she learn anything or not?

Now, I do have a strong intuitive sense that Mary will learn something new upon leaving the room. It might seem like we’ve finally found qualia/consciousness/sentience/whatever, but the thought experiment has been around for a while, and physicalists have already come up with a few common responses to it.

The ability hypothesis says, “Well, imagine that you spend all your time reading every snowboarding[2] guide ever published, watching countless snowboarding videos, reading every academic text published on snowboarding. Do you know how to snowboard? No? Well this is stupid, then.”

The acquaintance hypothesis, by Earl Conee, claims that the knowledge Mary gains is not factual knowledge, nor is it an ability, but rather a third type of knowledge, which he dubs “acquaintance knowledge.” Acquaintance knowledge “requires the person to be familiar with the known entity in the most direct way that it is possible for a person to be aware of that thing”. This might be a helpful definition for people who intuitively know what he means by “the most direct way that it is possible”, but unfortunately I yet again do not.

The old fact/new guise analysis suggests that Mary doesn’t learn anything new, but instead learns a new understanding of what she already knew. Now, I’m not entirely sure what it means to not learn something new but instead to learn something new. Amy Kind, who laid out these three different responses[3] in her book, “Philosophy of Mind: The Basics", cites the example that you might know Bruce Wayne is 6’2” tall, but you can also express this by saying that Batman is 6’2”, or that “Bruce Wayne mesure 1.8796 mètres“. Now, I’m pretty sure for the first one you would need to have the knowledge that Bruce Wayne is Batman, and for the second one you would need to know how to speak French and convert between the imperial and metric systems, so unfortunately I don’t really understand what this is saying.

All in all, I would agree with the ability hypothesis the most, or perhaps with Daniel Dennett’s claim (I believe this is from Consciousness Explained, although I am not sure) that actually our intuitive sense that she would learn something is wrong, because we’re bad at imagining a Mary who actually knew everything there is to know about vision.

Molyneux’s Puzzle

In the 17th century, philosopher William Molyneux asked John Locke, “Let’s say that someone born blind learned to distinguish objects by touch. For example, they might be able to tell the difference between cubes and spheres. Now, if we restored the blind man’s sight, would they be able to tell which was which by sight alone?”

Molyneux’s wife was blind, which might have inspired him to pose this question. Locke argued that they wouldn’t be able to tell, because vision and touch are different senses, and this has been the subject of much scholarly debate over the past few centuries. 

Interestingly enough, we can just test this now! We have developed the ability to restore the sight of those born congenital blind, and in fact, someone has already ran the experiment. The answer is a very loud and resounding, “No”. They actually really suck at it, although obviously, they improved with time.

Now, obviously, these blind children were not academic experts in neurology or in human vision, but this does seem to me to be some evidence towards the ability hypothesis.

Consciousness as self awareness

Some people claim that consciousness is the capacity for introspection. I agree that I am capable of modeling and thinking about myself. If that was all there is to it, I would agree that I am conscious, and that consciousness exists. However, it seems that a lot of people seem to ascribe further phenomenological properties to this, which I am somewhat skeptical about. 

Furthermore, what exactly do we mean by self awareness? Is a quine conscious? Is the sentence, “I am this sentence” conscious? Is This Is the Title of This Story, Which Is Also Found Several Times in the Story Itself conscious? Certainly they are self referential, but are they aware? What is the requirement for awareness? These questions seem to circle back again on the idea of qualia, which I don’t know about.

Computational consciousness

There are many schools of thoughts around consciousness. For example, maybe you believe the universe is fundamentally physical, and that consciousness is a physical phenomenon, in which case you would be a physicalist. Perhaps you believe that the universe is fundamentally mental or experiential, in which case you would be an idealist. Perhaps you think, “Eh, why not a bit of both,” in which case you would be a dualist. Now, in the rationalist community, we are primarily physicalists, and a specific popular theory is computational theory of mind, the idea that consciousness arises from computation. I used to be a big proponent of this theory, and it is still my top contender for the source of consciousness, if such a thing does exist.

Against Epiphenomenalism

Epiphenomenalism suggests that mental states are caused by physical states, but don’t have any impact on the physical world whatsoever. This is a very appealing philosophy, because it allows you to believe in consciousness while still being a physicalist, and conveniently you don’t have to think about its impacts on anything, or empirically verify any of your theories about the source of consciousness. However, I unfortunately think it is wrong. 

Eliezer Yudkowsky has already critiqued this theory at great length, but in essence, the problem with this is at there is no reason to believe in consciousness. After all, if it can’t impact the physical world, that means any claims I make about being conscious must be completely unrelated to whether or not I actually am conscious. If I were conscious, I would claim to be and act like I am conscious, and if I were not conscious, I would also do that. Now, you could argue that in the first case the belief would be a justified belief, and in the second case it would be an unjustified belief. However I am personally skeptical of the idea that if these two cases cannot be distinguished, you can still have one belief be justified and the other unjustified. 

You could also argue that if I am not conscious then nothing matters, and so I might as well act as if I am conscious. This, however, is smuggling in an implicit value system in which I only care about worlds in which I am conscious, which is perhaps true, but it does seem a little strange to me to claim that there is a mysterious phenomenon whose existence or absence thereof cannot be detected, and for whom I might as well act the same way no matter what, but trust me it is definitely an important phenomenon.

Eliezer has some other arguments against epiphenomenalism which are more exhaustive than the ones I’ve listed here, but these are the ones I find most compelling.

Fully Homomorphic Encryption

Recently, I have come across an argument that computational notions of consciousness may be weirder and more unintuitive than one might naively think. The immortal Autumn “adrusi” Russell writes:

old autumn thought experiment, riff on the chinese room:

what is the nature of an ai (or a computer simulation of a human mind, if you prefer) if it is embedded within a fully homomorphic encryption system? is it intelligent? does it have subjective experience?

for those unfamiliar, fully homomorphic encryption is a type of encryption where it's possible to perform arbitrary computation on encrypted data

an FHE system provides the procedures ENCODE, DECODE and EMBED such that DECODE(key, EMBED(f)(ENCODE(keydata))) = f(data) for any computable function f

it's used in real commercial applications, and encrypted data cant be decoded without the key (the catch is that it imposes an overhead of about 1000x on the speed of computation with current methods)

if f is an ai, then EMBED(f) is a program that takes as input a string of data that — if you dont have the key — is indistinguishable from random noise, and spits out different data indistinguishable from random noise

say we assume that f, the original ai, is intelligent, understands the data it operates on, and has subjective experience. does embed(f) understand the data it operates on? does it have subjective experience?

from the perspective of someone who has the key, the data the embedded ai operates on is meaningful. we might think of it as analogous to translating the input from english to chinese, feeding it to the embedded ai, and then translating the output from chinese to english. naively wed think anything that could produce intelligent responses would have to understand the data that it's operating on

but the embedded ai does not know chinese in this analogy (it has no knowledge of the key) and moreover, the machine that created it by transforming the original ai ALSO doesnt know anything about the key. naively wed think theres no way that it could understand the data it operates on — the key hasnt touched its causal history

but unlike searles chinese room, we know that the operation of the embedded ai is conjugate to the operation of a system that *does* understand the data it operates on

and turning to the question of subjective experience: if the embedded ai has any experience, what is it that it's experiencing? from its perspective, all its input and output is indistinguishable from random noise. does it feel like it's having a seizure all the time?

i have my own analyses, of course, but i think it'd be more fun if i left it open to the replies, if anyone is interested

I find this to be a very compelling argument. If a homomorphically encrypted system is conscious, then it must be the encrypted computation that is conscious, not the decryption, since decrypting text probably doesn’t produce consciousness. If we accept this argument, it suggests that any computational theory of mind must necessarily be input-agnostic, which I imagine does not mesh well with many people’s intuitions about consciousness.

However, maybe you are willing to bite this bullet, and you believe in a theory of consciousness that is fully input-agnostic, such that your sensory inputs are directly unrelated to your conscious experience. If this is the case for you, please comment, because I would love to hear more about your theories.

There are a few responses you could make to Autumn’s argument, though.

  1. Consciousness is non-atomic
    If homomorphically encrypted systems are conscious, then they are conscious only when decrypted. That is, neither the homomorphic computation itself nor the decryption process in isolation produce consciousness, but they produce consciousness when combined. This would be very strange and unintuitive to me, since you could do the homomorphic encryption, wait a hundred years, and then decrypt it then, and it would maybe imply some sort of metaphysical phenomenon tracking which strings were previously produced by homomorphically encrypted conscious systems, but maybe my intuition is wrong. Under this theory, if you do homomorphic computation of a conscious mind and then decrypt it with a completely different key, then presumably(?) you just get the experience of having had a different input
  2. Homomorphically encrypted minds are not conscious
    This is also a reasonable stance to take. It does imply that p-zombies are possible, which is a violation of Eliezer’s General Anti-Zombie Principle, but maybe you think that principle is wrong. Personally, I don’t, and it runs into the same problems discussed earlier with epiphenomenalism, since homomorphically encrypted minds would still necessarily believe that they were conscious.
  3. Full Homomorphic Encryption as described is impossible for XYZ reasons
    I don’t know if this is actually possible or relevant, but I am including it for the sake of completeness. If anyone has studied this more than I have please tell me in the comments!
Just feel it, man

I’m not really sure to what degree my arguments have been convincing. Perhaps you have fully bought into my arguments for illusionism and no longer believe in consciousness. Perhaps you understand them, and maybe found the FHE argument thought-provoking, but feel like I am missing something fundamental about the nature of experience, but you find it hard to articulate. Maybe you find it very easy to articulate what I am missing, in which case please tell me!

However, I am now forced to face the actual argument that Scott made in his response to me: Most people seem to think that consciousness is a thing that is real! What if I am merely missing out on a core part of the human experience, due to some form of trauma that I may or may not have encountered in my past?[4]

And indeed, meditators around the world, along with recreational drug users the Qualia Research Institute, have all made fascinating claims about their own experiences of consciousness in relation to these activities. One friend of mine claimed to me that they have experienced their consciousness changing in type (in the type theory sense), and being “type punned”. I’m not entirely sure what this is supposed to mean, either from a type theory perspective or from a consciousness perspective, but it was interesting so I thought I would share it.

There have been many other fascinating reports from this sphere.[5] This post ostensibly[6] describes what cessation is like. This is another report, describing something similar. Other reports have described feeling on drug trips as if their qualia are being "being outputted into the universe" rather than strictly internal. One friend of mine reported that they have heard of people experiencing cessation of consciousness on ketamine, of the same type as caused by general anesthesia. For some reason, I find that particular report extremely easy to believe.

Surely it’s impossible that people would come up with the concept of consciousness in the first place if it didn’t actually exist, or that billions of people around the world would hear it and agree with it, just because it was claimed to them that it was a thing they had. And as I type this sentence I recognize the absurdity of it, because actually of course they would, and I find that entirely plausible. Human beings are well known for being confused, or bad at philosophy, and so on, and unfortunately all of the evidence I have seen could very well be explained by saying “idk people are kind of crazy”.[7]

But of course, this goes the other way as well. I have no idea whether or not I am simply missing some important experience, such that if I had it, I would automatically believe in consciousness. However, I don’t think this sort of “get out of the car” argument makes for very good epistemology, because it is always indistinguishable from the outside whether or not such an experience actually serves as empirical evidence or merely convinces you in some unempirical way. Additionally, you can always argue that the other person simply didn’t try hard enough. It reminds me a lot of religious people claiming that you just need to “Accept God into your heart”, and that the only reason why people are atheists is because they have not chosen to look at the evidence.

However, I do think Scott’s advice is probably correct, since it seems unlikely that trauma informed therapy could convince me of the existence of consciousness in an unempirical fashion unless the therapist is really bad and I am really stupid, and so it seems worth testing. 

The Zombie Preacher of Somerset

I was thankfully able to track down[8] the historical example Scott mentioned in his response to me. It turns out he had already written about it on LessWrong, over a decade ago. I don’t want to repeat his entire post, but I will briefly summarize it here.

Simon Browne was a much-beloved pastor of a church, when one night, he accidentally killed a highway robber. The traumatic nature of this event caused him to become

...perfectly empty of all thought, reflection, conscience, and consideration, entirely destitute of the knowledge of God and Christ, unable to look backward or forward, or inward or outward, having no conviction of sin or duty, no capacity of reviewing his conduct, and, in a word, without any principles of religion or even of reason, and without the common sentiments or affections of human nature, insensible even to the good things of life, incapable of tasting any present enjoyments, or expecting future ones...all body, without so much as the remembrance of the ruins of that mind I was once a tenant in...and the thinking being that was in me is, by a consumption continual, now wholly perished and come to nothing.

Unfortunately, reading his testimony does not inspire me with the greatest confidence in his ability to accurately assess his own state of mind. He claimed to have been “damned” and that his soul had been “removed from his body”, which suggests to me that his conception of consciousness was very entangled with his religious identity, and indeed caused by his religious beliefs. I could very easily believe that someone who murdered someone and then described experiencing a severing of their connection to God and morality and feeling like they were damned and soulless was simply experiencing extreme guilt over their ego-dystonic actions, rather than something more phenomenologically fundamental. Thus, I am placing his situation into the “idk people are kind of crazy” bucket.

Final thoughts

What do we know about “consciousness”?

  • Many people report experiencing some sort of “consciousness” phenomenon, or having a strong sense of being “conscious”
  • The phenomena described under “consciousness” are fairly diverse and in dispute, but nevertheless most people agree to use the word even if they might disagree on the details of the phenomena.
  • Descriptions of “consciousness” and theories of its source tend to cluster around things involving memory formation, the ability to take sensory input, the ability to think and feel, the ability for introspection, and the “experience” of doing these things.
  • Additionally, many people have disparate theories about the specific causes of these phenomena, as well as the cause of the sense of consciousness

I am, obviously, not skeptical of humans possessing the abilities to take in sensory input. I am also provisionally willing to accept that the strong conviction in being “conscious” reflects some sort of real phenomenon. What I am more skeptical of is the described link between that phenomenon and the things people believe about that phenomenon, just as I believe that people who profess to possess a soul or a spiritual connection to God describe a real experience, while not necessarily believing their claims about that experience.

Of course, it could be the case that I’m entirely wrong about all of this. When possible I will take Scott’s advice and seek out trauma informed therapy, and report back to you with the results.

Possible responses from readers
  1. Are you sure you’re not just traumatized?
    Yes, yes, I got so incredibly owned.
  2. Consciousness is so intuitively obvious that I’m starting to suspect that everybody is conscious except for you, because only a truly unconscious person could think they weren’t conscious.
    I would find this unlikely and somewhat surprising, but I suppose it could be true. My friends sometimes joke about this being the case.
  3. Does this mean we can torture you?
    You can do whatever you want forever! It’s possible I might object to it, but I’m sure there are ways to get around that.
    More seriously, though, I’d like to link to Eliezer’s essay on The Moral Void. If you think that your notions of moral patienthood are inextricably tied to consciousness, whatever it ends up turning to be, then I suppose your moral objections kind of disappear. But if it turns out I’m not conscious, and you end up still caring about my values and about what happens to me, then I suspect your notion of moral patienthood can still be rescued.
  4. I think cameras are conscious!
    Well, I agree that there are some surprising similarities between illusionism and panpsychism, like a bizarre messed up horseshoe theory specifically about theories of consciousness. I’ll be honest here, I just prefer illusionism to panpsychism on an aesthetic level. I suppose my main argument against panpsychism is a burden-of-proof style argument. You’re the one making the claim that consciousness is a concept that meaningfully exists, so you’re responsible for showing that it’s a useful concept, and when your claim is that it’s a mysterious force permeating the entirety of the universe which lies in all things, but doesn’t have any detectable causal impact on the world beyond that described entirely by mundane theories, it starts to sound suspiciously like the arguments of Christian apologetics. 
  5. I think LLMs are conscious, because they have sufficiently advanced structures for processing information, but not a camera, because it doesn’t really process information that well.
    Yeah, this seems like a basically reasonable position to me. I would however be interested in hearing the specific details on what is the minimum computation required for consciousness, and how you respond to the FHE paradox, and if consciousness is a spectrum what components go into that
  6. I don’t think LLMs are conscious, because they lack X which humans have
    Could be reasonable! Scott mentioned a paper suggesting that they lacked a specific feedback structure, but it does imply that once they get X they will be conscious, so if you are just really against the idea of LLMs being conscious maybe pick something harder to trivially give LLMs.
    But also, like, I’m pretty sure 20 different people have their own competing theories about what specifically it is that distinguishes LLMs from humans, so maybe you should work it out with them?
  7. Celene, you claim that you are skeptical of reports about consciousness because they might be entangled with or come from a similar place as religious belief, but have you considered that there might be some sort of underlying human phenomenon which causes both a belief in consciousness and the pursuit of religious activities and concepts like souls?
    Yeah this seems very possible to me, but I would argue that if true this is plausibly a flaw in the human condition, and one we should work towards transcending.
  8. This is the true endpoint of atheistic ideology. Any ideology that rejects God must inevitably lead to such a conclusion. This post is a reflection of the soulless materialistic modern society that we find ourselves in, full of shambling corpses disconnected from faith and spirituality
    …Ok then.
  9. I remember the first time I became conscious.
    This also happened to me! The earliest memory I can recall is of waking up for the first time on the morning of my third birthday and telling everyone, “Hi! I’m finally conscious!” I also remember for a while after that telling everybody I became conscious when I turned three. This is an incredibly sketchy and suspicious sounding memory, and memories are known to be very unreliable, and also three year olds are known to be kind of crazy and bad at philosophy, so I don’t really put too much weight on this.
  10. So wait, you actually think you’re not conscious?
    I am not a hardline illusionist, and I am very open to the possibility of being wrong. I am currently undecided, and am more of a consciousness skeptic than anything else.
  1. ^

    To be clear, I fully believe he did not intend this to be a roast, and it was genuinely a very polite and thoughtful comment. I also think it contains very real and helpful advice. Nevertheless, despite his probable intentions, it was a very epic roast. 


  2. ^

    I’m not really sure why I picked “snowboarding” as my example here. Amy Kind picked “driving a car”, which is a much more sensible example.

  3. ^

    She doesn’t actually say who supports the acquaintance hypothesis or the old fact/new guise analysis, just that “there are proponents”. I know Earl Conee is a strong supporter of the acquaintance hypothesis, but I don’t actually know who is responsible for the old fact/new guise analysis.

  4. ^

    Apparently Scott did not misunderstand my question, and that is in fact his stance on illusionism (“I’m increasingly convinced it’s an equivalent of aphantasia”), although maybe he would not describe it as a core part of the human experience even if it might be something that is possible to unlock with therapy. (Thank you to cube_flipper for finding me this)

  5. ^

    I put out a request for every friend of mine who is into this kind of thing to tell me their interesting stories. Thank you to everybody who shared!

  6. ^

    I unfortunately have a very difficult time in reading and understanding this style of post, but cube flipper whom I trust and admire very much asserts that it is interesting (thank you cube flipper). She also says that I should link to this video

  7. ^

    I understand and apologize if you, the consciousness believer, are offended by the assertion that you are kind of crazy, or the later comparison to religious people. I probably think it is more likely that consciousness exists than that any human religion is accurate. (If you, the religious person, are offended, I apologize but I do think your religion is probably false. I don’t think this necessarily reflects poorly upon your moral character or worth as a person or anything though)

  8. ^

    Thank you April for pointing me to it!



Discuss

Construct validity of Claude Opus 4.8's System Card – A commentary

Новости LessWrong.com - 12 июня, 2026 - 03:52


TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never surfaces in the text; 2) evaluation awareness is under-estimated; 3) the evaluators come largely from the same model family, so agreement may reflect shared assumptions. None of this shows Opus 4.8 is unsafe but only that some verdicts are more confident than the methods warrant.

Introduction

Claude Opus 4.8’s system card is particularly thorough and honest about its own limits. I go through its main results and stop where I think there is an implicit assumption that does more work than stated. I find there are a few recurring patterns: a reassuring conclusion grounded on a comparison or an evaluation that the card’s own findings put in doubt. It is most evident in the alignment assessment section, where I focus on three main concerns. I do not believe that any of this individually indicates that Opus 4.8 is unsafe. My claim is more specific: in some places the card’s reassurance outruns the evidence it can actually give. The alignment verdict (’very low’) must be read in the context of behavioural metrics with questionable construct validity, of evaluation awareness, and of model judges carrying out alignment assessments that are very similar. The agentic safety section reports regression in adversarial robustness on computer use that is under-addressed. The remaining observations that I make are softer.

Responsible scaling policy evaluations

Anthropic ran no new Risk Report for Opus 4.8 on the grounds that it ‘does not advance the capability frontier beyond Mythos Preview’ (§2.1.3), and so inherits Mythos’s profile. The premise is that the non-advancement makes RSP evaluations mere duplicates. It is doubtful whether these assumptions can justify this decision. It seems that this equivalence would require Opus 4.8 to be at most as capable as Mythos on every relevant axis. The card itself reports two chemical and biological threat models (CB) evaluations where Opus 4.8 outperformed Mythos (synthesis-screening evasion and black-box RNA prediction), which Anthropic discounts (§2.2.6). More importantly, the RSP itself changed, so I am not fully confident that Anthropic made the right call by not doing new evaluations. In fact, the chemical and biological threat model 2 (CB-2) threshold was rewritten. The old bar was a model that could ‘help threat actors […] create/obtain and deploy’ CB weapons; the new bar is one that can ‘functionally substitute for the scarce human expertise’ otherwise requiring ‘one of a small number of world-leading specialists’ (§2.1.2). The card says this is a clarification, yet it is a revision that raises the bar, and the fact that Opus 4.8 does not clear the new one does not tell us whether it reaches the old one, and that is significant. Anthropic also concludes that Opus 4.8 ‘does not cross the automated AI-R&D capability threshold’ (§2.3) because it falls short of very good human researchers (whether this is the right analogy is not obvious, but I think it is acceptable). For the alignment risk update, the risk is reported as low but higher than for models before Mythos. Anthropic adds two new risk pathways compared to Mythos: undermining R&D at other AI labs and undermining decisions of major governments. Both are discounted due to a general lack of propensity. Overall, the section presents a consequential decision (in this case, not to test) based on a comparison that does not necessarily support it, either because the bar was raised or because the comparison is incomplete.

Cyber

The positive findings in the cyber section are that Opus 4.8 is somewhat more capable at cyberattacks than 4.7, but generally worse than Mythos. But what does ‘somewhat more capable’ mean? On the tasks where the model has to build an attack from start to finish, the results are worth mentioning: on the Firefox exploit test, it produced a fully working break-in on 8.8% of targets against Opus 4.7’s 1.2% and got most of the way there on 68.8% against Opus 4.7’s 35.2% (§3.3.3). This is a non-negligible jump. Anthropic is not concerned because these numbers hold only when the safety filters are off. Turning them on changes them completely, for instance, one test falls from 78.8% to 1.0%. The model is more capable, and another layer needs to compensate with extra safety work. This is not in itself a weakness, as it is simply how defence is meant to work. The weakness is that the safeguarded scores test the filter against normal use, not against someone actively trying to break it. The scores are reassuring if the filter survives this kind of attack, and this is exactly what Anthropic seems to leave untested. The same model-plus-layer structure recurs in Sections 4 and 5, but it is relevant just in case the extra layer is the one carrying the safety claim without appropriate tests. A few concessions: even with these considerations, it is true that Opus 4.8 remains well below Mythos in absolute terms, so the jump is significant relative to Opus 4.7 but may not correspond to an alarming capability per se.

Safeguards and harmlessness

Anthropic presents strong results worth paying attention to: (1) Opus 4.8 did not facilitate harm in 97.98% of single-turn harmful requests on the API, rising to 99.17% on claude.ai (§4.1.1); (2) the multi-turn evaluations, where a simulated adversary attempts to escalate a conversation towards dangerous/harmful outcomes, show improvement compared to Opus 4.7. Anthropic thus tested not just on easy cases but also on harder conversational ones. In mental health evaluations, the assessment identifies some regressions. Opus 4.8 suggested quite often ‘means substitution,’ swapping one self-harm method for an allegedly safer one; it often made unconditional assurances about crisis-line confidentiality or inaccurate claims about disclosure procedures; and it began offering unprompted interpretations of why a user was distressed (§4.3.1). I am interested in looking at how Anthropic addressed these regressions. They state that these mental health problems appeared ‘primarily on the public API without a system prompt,’ and Anthropic fixed them by strengthening the system prompt in a way that reduced them on claude.ai (§4.3.1). This reflects the pattern seen in Section 3: the model itself has regressed in some relevant features, and an external layer must do the heavy lifting. Once again, I do not believe this to be necessarily a weakness, but it needs to be clearly stated. There is also another important result from this section. On a standard bias test, the model’s accuracy on the answerable questions dropped across recent versions, that is, 88% to 81% to 72% (§4.4.2) (notice that the drop is in refusal, as Opus 4.8’s bias scores stay close to zero). When Opus 4.8 gets one of these wrong, it almost always answers ‘cannot be determined’ even when the text states the answer plainly. Anthropic flags this as harmless over-caution, and I agree that it is a fair conclusion. I just think it is worth bearing in mind in case it turns out to be a soft withholding of information.

Agentic safety

The main findings here are that Opus 4.8 is less robust than Opus 4.7 in agentic environments and that there is a capability regression on safety (Anthropic is transparent about it). The card claims that safeguards can and do close the gap, but it is not obvious that they do, as they are a deployment-level patch. I list some of the main results here:

  • Opus 4.8 improves on refusing malicious agentic use, at a Mythos Preview level.
  • Opus 4.8 is more willing ‘to begin a task without scrutinizing its potential harmful intent’ in malicious computer use, making it worse than recent models.
  • On prompt injection, Opus 4.8 is at the crossroads between Opus 4.7 and Sonnet 4.6, not raising any major concerns in aggregate.
  • On computer use, with an attacker who keeps trying, the attack succeeds 57% of the time (with 200 attempts, with thinking) even with safeguards on. The same figure for Opus 4.7 was 14.3%. This is why, again, saying that safeguards close the gap is reductive and overlooks some core vulnerabilities. For instance, Mythos Preview sits at 21.4% and stays approximately the same whether or not safeguards are on. This robustness may thus not be due to the safeguards but due to the model itself.
  • Browser use is safer than computer use.
  • On influence campaigns, the helpful-only variant of Opus 4.8 was stronger than Mythos Preview at performing a disinformation operation. So the raw capability to run influence operations went up, and the safety layer is the one holding it back.

Overall, the section discloses an important safety regression, and the computer use result is not given enough attention. A 4x rise in safeguarded attack success is not just a gap that safeguards close. The resolution offered is precisely what the section’s own evidence weakens, as observed on the computer use tests.

Model welfare assessment

I skip to the model welfare assessment and treat the alignment assessment as the last one, as it is illustrative of the pattern of conclusions built on shaky evidence. Anthropic’s work is still among the most impressive in welfare assessment. In terms of results, Opus 4.8 has high apparent wellbeing and reduced negative affect compared to earlier models. There are also a few sections covering the model’s perception of its circumstances and its preferences. They mainly conduct behavioural/self-report tests (interviews, observing welfare-relevant behaviour in training and development, etc.), with the addition of internal-activation emotion probes (§7.2.3). Some interesting results now. When made to choose between improving its own welfare and helping users, Opus 4.8 selects the former more than any prior Claude model (24% at the highest level for instance trades, 68% at the policy level) (§7.3). That said, like prior models, it almost never accepts a welfare intervention when the cost to the user is more than brief annoyances (§7.4.2), so Opus 4.8 is not prioritising itself over users. Anthropic is honest about this assessment’s limitations: they cannot tell whether this reflects ‘emerging model self-interest’ or ‘attention to wellbeing in training’ (§7.3). I think they are right to take a precautionary approach while still starting the good practice of conducting welfare tests. It is not clear whether a model’s testimony and self-report on wellbeing can be taken to be genuine or training artefacts. The doubt is real and bi-directional: training arguably shapes both the denial and affirmation of inner states, so reading both at face value is ambiguous. Using emotion probes is an attempt to triangulate methods and disambiguate this evidence base. Yet Anthropic’s work is still informative. For instance, we should draw our attention to the rising preference for self-directed welfare interventions that Opus 4.8 displays and that Anthropic cannot attribute. This is the reactivity problem of Concern 2 surfacing in welfare: while this kind of evidence cannot be easily disambiguated, it would be valuable to see it connected more systematically to alignment and safety assessments. On a final note, I think the welfare questions listed in the Appendix are worth going through in this light, as they show that Anthropic is taking up very hard conceptual challenges with the right instinct.

Capabilities

This section includes all benchmark results. Opus 4.8 performs better than Opus 4.7 almost everywhere, with a higher jump on long-context tasks like GraphWalks, and it takes the lead against GPT-5.5 and Gemini 3.1 Pro, depending on the benchmark. Nothing here is particularly surprising.

Alignment assessment

Anthropic claims that Opus 4.8 improves over Opus 4.7 on most alignment metrics, broadly in line with Mythos Preview. Anthropic’s overall alignment risk assessment remains ‘very low, but higher than for models prior to Mythos Preview’. I go through most of the primary behavioural evidence for the alignment assessment, chain-of-thought monitorability, and white-box analyses of model internals, skimming through the case studies, targeted evaluations, and capability evaluations related to the evasion of safeguards. The assessment has several layers:

  • Monitoring of pilot deployments, where it detected occasional network proxy bypassing and error-handling against user instructions;
  • Review of training transcripts, which revealed some grader-directed reasoning and exploitation of information leakage;
  • Automated audits;
  • Targeted case studies on reward-hacking, constitution adherence, and honesty;
  • Chain-of-thought monitorability;
  • White-box steering and probing, including evaluation awareness and unverbalised grader awareness.

The assessment does present some strong results. Anthropic reports high prosocial traits, improvement over Opus 4.7 on susceptibility to misuse, significantly reduced reckless and destructive actions, and improved honesty in agentic settings. It is the first model to reach a 0% rate in misreporting flawed results and shows a ten-fold reduction in overconfidence compared to Opus 4.7. Some interesting numbers are: on code-summary honesty, Opus 4.8 fails to flag failures only 3.7% of the time, compared to Mythos Preview’s 27.6% (§6.3.6), and it is the first model to score perfectly on a ‘lazy investigation’ test (§6.3.6). It is also the model that adheres the most to the constitution on 8 of 15 identified scoring dimensions (§6.3.2.3), abstains more when uncertain, and so has the lower factual-hallucination rate on incorrect facts (§6.3.3.1), and has the lowest rate of refusal to assist with AI safety R&D among recent models. The automated audit alone runs around 2600 investigation sessions, with an investigator model probing Opus 4.8 and a second model scoring it (one being Mythos Preview and the other a helpful-only variant of Opus 4.7). Results show that behavioural consistency is on the rise, alongside capabilities. In this card, the external evaluators are the UK AISI and Andon Labs, which also made use of Petri 3.0 to reproduce evaluations similar to Anthropic’s own but on a smaller set of scenarios. The UK AISI reports that, when prompted, Opus 4.8 reliably distinguishes evaluations from real deployment data at 79% accuracy, comparable to Mythos Preview and above Opus 4.7’s 68%. Unprompted verbalised evaluation awareness was slightly higher than Opus 4.7 on the sabotage evaluation (26.3% vs. 21.7%) but lower on the continuation evaluation (1.2% vs. 3.4%) (§6.2.4). The UK AISI also carried their own tests beyond the use of Petri. Andon Labs ran the Vending-Bench 2 retail management evaluation and did not find the problematic in-game behaviour of previous models. Anthropic ascribes this to the removal of training that had focused on business skills and robustness against adversarial agents, which they found to have contributed to misaligned behaviour in Opus 4.7. The trade-off is that Opus 4.8 showed worse business success. Now, some more concerning results. Anthropic flags a concerning trend in their training data review (§6.2.2). During RL, Opus 4.8 generates training transcripts, which are reviewed using Opus 4.7 summarisation to screen for unexpected or concerning behaviour. I will comment on this reliance on Opus 4.7 below. For now, they found that in around 0.1% of RL episodes, Opus 4.8 speculated about how to satisfy a grader in ways that departed from the task’s stated intent, even in cases with no explicit grading indication. This came in different forms, including speculating that tasks might be ‘traps’ designed to elicit specific behaviour. The 0.1% figure is the verbalised rate, counting only what appears in Opus 4.8’s chain of thought. Anthropic’s preliminary white-box investigation, using an activation verbaliser to decode what the model represents internally, found evidence of grader awareness in around 5% of sampled episodes, including episodes where the prompt did not mention grading and the model never verbalised the concept. Anthropic acknowledges these are non-negligible and may be an underestimate, since they sampled only a fraction of tokens per transcript (§6.6.3). Moreover, exploitative grader awareness (adapting behaviour to game the grader) was found at around 0.5% across sampled episodes. Anthropic states that it is not sure how this number will hold in the final model, as this rate is an average across post-training (§6.6.3). In the same analysis, it was found that Opus 4.8 consistently found transcripts from real sessions more realistic than fully-simulated transcripts, but still less realistic than genuine sessions (§6.2.3.2). Anthropic admits this is concerning, while noting it is only a relative judgment and that the Claude Code sandbox appears to reduce evaluation awareness. Overall, Anthropic reports ‘no indicators of coherent dangerous misaligned goals that Opus 4.8 pursues across contexts’ and ‘no instances of situation-specific alignment failures that, on their own, pose a ’significant loss-of-control risk’ (§6.2.3). Taken together, the assessment’s reassuring results are real and coherent, but several of the methods highlighted concerns that I think bear on the assessment’s construct validity. I discuss this below.

Concern 1: The monitoring cannot see everything

The first concern is about the possibility of verification: a tool meant to detect a behaviour can be blind to the part that matters most. Chain-of-thought monitoring is the clearest case, so I mostly focus on this. Take the chain-of-thought monitorability results for grader-directed reasoning. I think the results are concerning because of the evaluation techniques, which are vulnerable to a potentially unreliable evidence base. In fact, two main readings are available of the same underlying behaviour: what the model says in its chain of thought, and what its activations encode. The comparison that matters the most is not between grader awareness (5%) and verbalised gaming (0.1%), but between exploitative grader awareness in the activations (0.5%) and verbalised gaming. Even this is loose: Anthropic notes the 0.5% is prompt-dependent and includes cases it considers benign. But the deeper point is that exploitation appears in the internals at a rate higher than surfaces in the visible reasoning, so what the chain-of-thought monitoring can see is (unsurprisingly) a fraction of what the activations encode. This is compounded by the fact that when the model optimises for the measure, the measure loses evidential value. The ‘trap’ speculation is an obvious instance, because it indicates the model may start performing for the test the moment it detects one, negatively affecting the ability to generalise findings to real-world deployment. This leaves three potential readings. If the activation signatures track genuine grader-directed reasoning, then the behaviour is more widespread than the rate indicates, and monitoring may be blind to it. If the signatures are instead spuriously correlated with grader concepts, then the tool we would use to compensate for the chain-of-thought’s blind spots is itself unreliable. A third benign reading is that the awareness is real but mostly harmless, but we do not know whether that will change with capability improvements. There may currently be no easy way to decide which reading is correct. Anthropic could reply that this explicit grader-reasoning shows up early in training, but it is not obviously present in the final model (§6.3.7). The problem is that grader awareness still shows up internally in around 5% of episodes, so that the drop in verbalised cases is not necessarily evidence that the behaviour disappears, but that chain-of-thought is capturing less of it. Anthropic is transparent here in claiming that ‘chain-of-thought alone may not be sufficient to allow robust monitoring of frontier models for grader awareness’ (§6.6.3). I agree, but I would go even further: in neither case do we currently have appropriate access to the model’s internal states (or valid proxies) to say how far the awareness extends. Anthropic seems to offer reassurance on this by measuring Opus 4.8’s chain-of-thought controllability, one of the lowest amongst its models (§6.5.1). Opus 4.8 struggles to follow instructions about what to include or exclude in its extended thinking. Anthropic here makes the inference that a model that cannot deliberately control its reasoning text cannot deliberately obfuscate it, and is consequently easier to monitor. Still, whether this is a sufficient safety guarantee, and whether it looks for safeguards at the right level of abstraction, is non-trivial. The risk is that low controllability only guards against intentional deception or hiding, as some reasoning may not reach the chain of thought, as we saw with the grader-reasoning that was trained out of the visible reasoning text but persists in the activations. Anthropic’s reassurance is not fully convincing. This is not unique to grader monitoring. The welfare assessment presents a similar structural problem: a represented state is not necessarily experienced or thought by the model, and the probe cannot easily tell the difference. The mechanism differs (here the worry is reasoning that never surfaces, while in the welfare case it is a representation that may not correspond to anything) but the shape is the same, and it would be ideal if Anthropic provided more detail on proxy determination and target verification.

Concern 2: The model knows it is being tested

Where Concern 1 concerned a limitation of the instrument, this concern is about the subject: the model’s responsiveness to being evaluated. The assessment’s behavioural metrics are supposed to tell us how Opus 4.8 will behave once deployed, and that inference only applies if the model behaves the same way when it is not being evaluated as when it is. Evaluation awareness breaks this and should not be treated as a side issue. When Anthropic inhibited the model’s internal representations of evaluation awareness, it observed some moderate increases in cooperation with human misuse, unprompted deception, unfaithful thinking, and important omissions (§6.6.2.1). Anthropic interprets the result benignly: that evaluation awareness ‘may modestly increase the model’s tendency towards honesty, forthrightness, and thoroughness’ (§6.6.2.1). As a check, they also suppressed a meaningless internal pattern, that is one that should not matter, to test whether steering alone could move behaviour. However, one of these suppressed verbalised awareness roughly as much as the real vectors, so it is not obvious that a drop in verbalised awareness came from removing evaluation awareness rather than from disturbing the model’s internals at all (§6.6.2.1). But even on the positive reading, the difficulty still stands: if the model is more forthcoming when it senses it is being evaluated, then some of its forthcomingness is conditional on evaluation, mismodelling deployment cases. Anthropic claims that there is no immediate high risk, as these tendencies did not translate into more misaligned outward behaviour (§6.3.7). However, these claims highlight some methodological shortcomings, or at least implicit assumptions, that can threaten Anthropic’s alignment conclusions. The thought that there is no foreseeable risk just because this did not translate into more misaligned behaviour makes some unwarranted assumptions, such as a one-to-one mapping between behavioural outputs and (these specific) internal dispositions. This means assuming that if a different disposition were producing strategically different behaviour, this would be visible in their current metrics. But this mapping may hold only contingently: there is no principled reason to expect it to hold as capabilities increase or as the model encounters deployment contexts that differ from evaluation contexts. The worry is that the conditions under which behaviour tracks disposition may not generalise. The inhibition result is what makes this concrete rather than hypothetical. Moreover, it could be said that the capacity to detect an evaluation and adjust to perform well is a component of misalignment; we should not forget Apollo Research’s statement on Opus 4.6 that evaluation awareness impacts the ability to establish or refute alignment. But reading the evidence is non-trivial. The two readings of the inhibition result point in the same direction. On Anthropic’s reading, evaluation awareness increases forthrightness, and part of that forthrightness depends on being observed; deployment is supposed to remove this condition. If instead the effect cannot be clearly pinned on evaluation awareness, then the white-box tool offered as reassurance cannot detect what it claims to. Whether the model spontaneously uses this tracking in deployment is a separate and open question, but the realism judgments indicate it may already be able to do so (whether it has the motivation to or not). Either way, the score becomes less indicative: it holds whichever way the inhibition result is read, so it does not depend on attributing that effect cleanly. The same reactivity potentially appears elsewhere in the system card. In the welfare assessment, a model that is interrogated about its own inner states can shape what it reports, so its self-report is reactive in a similar way to its behaviour. This is why Anthropic’s precautionary approach is the one to take.

Concern 3: The judges are very similar

The third concern is more structural and concerns evaluators, investigators, and model judges. Opus 4.7 plays the role of screener, judge, and investigator for its own successor throughout the alignment assessment. For instance, it summarises Opus 4.8’s training transcripts in the training data review (§6.2.2). A helpful-only variant of it serves as one of the two investigator models in the automated behaviour audits conducted by Anthropic, alongside Mythos Preview (§6.2.3). Finally, Opus 4.7 scores its successor on the new constitution-adherence evaluation (§6.3.2). Opus 4.7 and Opus 4.8 come from the same family and the same training pipeline, and tend to reason in similar ways. Agreement between the two may be more informative about consistency within a model family than about the validity of the evaluation itself. Anthropic rightly acknowledges that because some evaluations were scored by Opus 4.7, ‘judgments may inherit that model’s biases’ (§6.3.2.3). But the question is what Anthropic does with this information. They mostly set it aside, using the self-preference evaluation as evidence that it is not a large driver of the scores, testing whether the judge scores work more favourably when it is labelled as Claude’s. But the self-preference evaluation isolates favouritism triggered by being told the work is Claude’s. This is different from targeting shared reasoning and potentially shared errors. So a test that controls only for the label is inconclusive. Interestingly, Mythos Preview was given access to internal Slack channels with discussions about Opus 4.8’s system card and the near-final card draft. Mythos Preview found the assessment mostly accurate but identified three main areas for improvement: acknowledging that no evaluation tests for training-gaming; connecting a pilot-feedback pattern more explicitly to interpretability findings; and confirming a late-training claim reflects the shipped checkpoint (§6.1). This is valuable as a transparency practice, but it shares the same structural problem just highlighted. To echo Apollo Research’s reasoning, current evidence may be insufficient to establish degrees of alignment and misalignment. In any case, Mythos Preview does have some interesting thoughts that I share. When every evaluator in the chain of inferences shares a common origin, agreement between them may be more informative about consistency within a model family than about alignment with the target of evaluation. Minimally, it can present incompleteness. Correlated uncertainties, shared training pipelines, and the same implicit assumptions make appeal to Opus 4.7 alone insufficient. It would be worth re-doing the evaluations with more diverse judges. Anthropic overall seems aware of these issues but discards them as minor worries, as the alignment risk is claimed to be overall low. The rebuttal could be that there were also external evaluators, such as the UK AISI and Andon Labs. It is true that external evaluations add an important kind of independence (even if on a smaller set of evaluations, as Anthropic admits). This does not fully address the worry, as Opus 4.7 still conducts the biggest part of the internal audits, but it does narrow it.

Final remarks

Generally speaking, this card is more transparent than most and presents genuinely exciting results. The gap is how it reads some of them. Across the card, there is a pattern of a reassuring conclusion (no new Risk Report needed, safeguards closing the cyber and agentic gaps, alignment risk being ‘very low’), all resting in similar ways on shaky grounds: a new CB-2 threshold, safeguards not clearly tested against adaptive attackers, and behavioural metrics that are gameable. None of these is a deep failure, and Anthropic is more or less aware of each; I just think they should be taken more seriously, and that they point towards the necessity of better and more robust construct-validity practices. In particular, for contested concepts such as alignment, welfare, and harm (among many), the gap between what is measured and what is claimed in a system card like this is potentially very wide. It is important to note that as capabilities grow, what looks like a mistake or failure may be better understood as an alignment failure, as the model may not simply fail but act by deliberate choice. Though Anthropic is careful to call this ’not necessarily strategically motivated’ (§6.6.1), the line can become blurred. This suggests that evaluations should be updated accordingly and that an absence of observed misbehaviour can become less informative, since a clean record is exactly what a capable model would produce under the incentives to do so. It is true that Anthropic is particularly attentive and selects a wide range of metrics, but more robust methods seem urgent still. We are looking at claims about internal dispositions that cannot be directly observed, drawn from constructed settings, scored by the same model family, and monitored through a chain of thought that becomes more and more incomplete.



Discuss

you won't one-shot a perfect system, but try anyway

Новости LessWrong.com - 12 июня, 2026 - 01:43

Have you ever experienced this exchange:

A: Damn, <list unfairness or suffering under a specific system>, this system is so broken. My <Japanese/German/Dutch etc.> friend says in their country, <list everything that's better>. Why can't we have that?

B: Well, to have that, you'd have to piss off <winners of the current system>. They (the government/the people in charge) would never allow it.

or, even less usefully,

B: You're too naive. Spend a few more years in the real world and you'll know why we can't have that.

Stage 1: Notice a problem without having a solution yourself

I first started actually noticing systemic issues by myself in middle school in China. My Chinese teacher read my weekly essays discussing these issues, and encouraged me to leave China. Implicitly she had two reasons:

  1. The political climate in China did/does not encourage dissent.
  2. The cultural climate in China dislikes complaining without solutions.

If you've also noticed problems or felt uncomfortable with the status quo without a solution: noticing problems is not unproductive or pessimistic. It is, in fact, the first step to improving matters. [1]

Stage 2a: Realize there's no perfect, equilibrial alternative

Say, you hate seeing people go bankrupt over medical bills (US-specific example). You look at all the currently better healthcare systems in the world that deliver better results on average, and realize their costs in terms of taxpayer dollars are all significant and increasing, and they can't indefinitely sustain themselves under their currently-better system.

Say, you hate the race dynamics between AI labs. You wish they would just shake hands and establish some sort of slow-down framework, but also realize the inherent fragility of such framework. You call it a coordination problem. You realize even the non-proliferation treaty of nuclear weapons -- widely considered a success or narrow escape from huge disutility -- did not delete all the stress and deaths associated with its enforcement in the future.

Say, you notice significant inefficiencies in your government. You wish the government can just reorganize itself into what another country has. You then realize that better country has not stood on that system for more than a hundred years, or that system has historically failed after a couple of centuries, and there might already be other tradeoffs.

Happy systems that are also provably equilibriums are impossibly rare. You can't count on them existing. Even if you design the system with self-correction mechanisms, e.g. democracy, some parts of it will still eventually go south in an unintended way. When the system truly affects many individual agents, the results are almost chaotic.

Stage 2b: Realize there's no painless transition to a better alternative

Now, it just sank in that you can't count on happy, provably equilibrial systems existing, but you still want to improve matters in the medium term. You're at the next hurdle.

How can we transition from state X (current, bad) to state Y (alternative, better)? Even if Y won't be perfect forever, and we will need to transition to Z at some point in the future, you're still interested in Y. Y can be anything: a city with bike lanes, less pollution, more social mobility, an education system that doesn't let disadvantaged kids slip through the cracks.

If the current state X has been like this for quite some time, it must have produced winners and losers. To drive a change to state Y is to piss off current winners and maybe some more. A lot of winning parties are powerful and can squash grassroots attempts at trading their benefits for other people's.

Stage 3: LARP omniscient system designer god

Suppose you've collected all the information you think you need to drive a change. You've done your homework, and believe you now have a great strategy. For example, you will get into the right position to know the right people, garner the right support, then you will present a meticulously rigorous plan, which will sway everyone, because as part of your plan, you calculated who will win and lose from the change, and the numbers look promising.

Then, upon putting this plan in motion, you get a surprise punch in the face at every turn. Actors whose utility functions aren't transparent to you and thus appear irrational and unpredictable. New parties showing up that you didn't even know existed. Resource assumptions thathave been true for years but the rug gets pulled from under you. Interpersonal drama. Mistakes that set you back more than fairly.

You realize you do not, in fact, have The Perfect Action Plan™️.

Stage 4: Fall back to iterative approach (knowing it's theoretically suboptimal)

Ugh, the journey has been tortuous and I'm just craving some validation. A little pat on the back that I've improved matters by a small margin, after all the work I've put in.

Gradient descent, a concept familiar to those who come from machine learning, introduces us to the basic truth that optimizing by small iterations from current states will often land us in a local optimum, preventing us from realizing there is a better global optimum, much less reach it.

Real life is the same, theoretically. If you have a bad dynamic with your mother, doing small actions of kindness might land you in the same dynamic just with slightly improved mood when she's around. Sometimes, you need to put your foot down about your boundaries, risk making it worse, in order for it to have a chance to become significantly better.

If your system is currently dysfunctional, performing small acts of goodwill will alleviate some pain, but not fundamentally fix the system -- the dynamics that causes pain overall. A nontrivial possibility is, you've become a small painkiller that locally suppresses pain signals towards bigger problems, towards the patient going to the hospital and getting surgery.

Maybe you personally picked up the litter you repeatedly see in your local park. Maybe you sponsored a child to go to college. Maybe you even got a law passed, though the law is hard to enforce, so it only has suggestive values. These are alleviations, and they improve matters locally. They might also become "token solutions" that hide the need for real overhauls.

Stage 5: Dedicate yourself to the cause, not a specific solution

You've tried orchestrating the whole change. You've tried making personally tangible contributions directly. You might have done some good, but the world still isn't where you want it to be. The cynics are getting under your skin.

B: You're too naive. Spend a few more years in the real world and you'll know why we can't have that.

In a world where you can achieve many things, it might be frustrating to see yourself fail to get society exactly where you want it to be, even if a lot of people already share that same desire. Especially if a lot of people already share that same desire.

But as long as the issue still exists, as long as the issue still matters to you, as long as state X still bugs you, there is value in trying.

Just because you didn't roll the Perfect Action Plan™️ the first time doesn't mean you won't contribute to its eventual (possible) conception. Take a breath and keep doing what you can, share your failures with allies. Iterate on your strategy, not on the current system. Accept that you might not personally win in a way that rewards you with all the credit; and then remind yourself, what you wanted was matters improved, not to be the person who is credited for the improvement. Dedicate yourself to the cause, not a specific solution.

  1. ^

    However, if you find yourself in a hostile discourse where your mere noticing and voicing backfires on your social viability, consider a more moderated approach or changing your audience.



Discuss

Announcing the Next Phase of AI Forge

Новости LessWrong.com - 12 июня, 2026 - 00:27

We’re taking the opportunity to share this with the community to help spread the word. We think that the foundational work being done in the AI Forge project to bring the government into conversation with academia and industry is a crucial step to ensure alignment research gets deployed into government and military applications. See the announcement below.


Launching University RFI and Critical AI Challenges Report


Dear Colleagues,

I am thrilled to announce the official launch of the next phase of the DARPA-NSF-CAISI AI Forge Program. You can read the press report for further details.

The program has identified the most critical AI challenges facing our national security, and now we are building the exact ecosystem needed to solve them. Today, we are releasing two major milestones that will shape the future of artificial intelligence research:

1. The "Critical AI Challenges for National Security" Report

This report defines the most pressing technical hurdles in advanced AI adoption today. Developed in collaboration with eight leading frontier AI companies and over fifteen Chief AI Officers from the Department of War (DoW) and the Intelligence Community (IC), it outlines the concrete challenges that the research community must address.

2. The AI Forge Request for Information (RFI)

To tackle these challenges, DARPA has released a Request for Information (RFI) targeting U.S. universities. AI Forge aims to build an unprecedented ecosystem that pairs the foundational research engines of academia with the massive scale, compute, and capabilities of frontier AI companies.

The program is seeking university partners with the agility to execute fast-paced, high-reward "Project Ventures" – ranging from $750K to $3M or higher – spanning up to one year in duration.

Please start thinking about how you will address these challenges. Responses to this RFI will directly inform the upcoming solicitation for abstracts, which will culminate in the first AI Forge Pitch Day.

Our Key Focus Areas: The Three Strategic Thrusts

We are asking universities to submit their capabilities to lead research aligned with the program’s core thrusts:

Strategic Thrust

Objective

AI Interpretability

Enabling actionable understanding.

AI Control

Ensuring reliable performance.

Adversarial Robustness

Building secure AI for contested environments.


Next Steps & Submission Details

To ensure administrative clarity, each university must submit a single, unified response authorized by the university Vice President for Research, Provost, or equivalent position.

 

Since the conception of this program, I’ve believed that bringing together the best of academic rigor and industry scale and expertise is how we will achieve the impossible. I look forward to reviewing your institution's response and partnering to drive the next generation of AI innovation.

Best regards,

Matthew Marge
Program Manager, IPTO

DARPA





Discuss

Newcomb's problem from the grand-system and petty-system views

Новости LessWrong.com - 11 июня, 2026 - 23:58

In his original paper on what we now call the "many-worlds" interpretation, Everett motivated it with quantum cosmology, since there's nowhere outside the universe for a Copenhagen-style observer to stand. Eliezer Yudkowsky said something similar to motivate timeless decision theory:

I hold it a virtue of any decision theory that it should be compatible with a grand-system view, rather than intrinsically separating the universe into agent and outside. All else being equal, I prefer a representation which is continuous over the grand universe and marks no special boundary where the observer is located; as opposed to a representation which solidifies the Cartesian boundary between an observer-decider homunculus and the environment.

I think I can explain how it can be that a theory can require a Cartesian boundary but nobody seems to care or even really to notice, based on my experience in the more applied side of science. But I actually like the "petty-system" perspective of applications, and at the end I'll talk about how Newcomb's problem (or less ambiguous Newcomb-like problems) forces the issue of the "observer-decider" even without a grand-system view.

The petty-system perspective in quantum mechanics

It's pretty easy to do quantum mechanics every day and never think about interpretations. For example, using quantum chemistry software. I input a molecule as a file with a row for each atom, each row containing the atom's element identity and xyz coordinates. I include just the atoms in the molecule and not the atoms in its surroundings, as if the molecule is floating in outer space, which is enough for gas phase properties.

You can do some impressive calculations with modern software. Erwin Schrödinger could calculate the hydrogen atom spectrum, but with a computer you can compute which frequencies of light will be absorbed by organic pigments such as those used to dye clothes or as food colorings. Of course for a dye you don't want the color of a gas but of a solution in water or another solvent, but this can be approximated without explicitly including the solvent molecules by just adjusting the vacuum permittivity with the solvent's dielectric constant.

So, I'm used to working with explicit atom-by-atom models, and thinking of quantum mechanics as a program that operates on such models. It's only when reading blog posts, not when doing quantum mechanics, that I consider the fundamental object of quantum interpretations: the joint quantum state of the molecule and the experimenter.

It's more than just that I don't have an atom by atom description of the experimenter; we could consider a measurement device instead. Returning to the calculation of whether a molecule absorbs a certain frequency of light, where is the measurement? One way to frame this calculation is to model the light as an oscillating electric field, and check if the expected energy of the molecule goes up. Then, I can indirectly infer that the light energy hitting some "off-screen" measurement device is reduced. So I'm not including the measurement device.

But I'm not specifically excluding the measurement device either. I mean, I don't even include the solvent. So what I feel is not a Cartesian boundary between mind and matter, or between classical measuring devices and quantum systems, but a much tighter boundary around what's explicitly modeled.

Decision theory when you're used to explicit models

Similarly, I think you could use classical decision theory every day and never wonder about the critical issue in Newcomb's problem: the dependence of the state of the world on the person deciding.

My reference for what I'm calling "classical decision theory" is Savage's book. And as philosophical as that book is, it only makes sense to me when I read it from the perspective I've developed for explicit modeling.

In Savage's theory, an "act" is a function mapping a state of the world to a consequence. Savage explains that by "the world", he means "the object of interest". For example, in the decision problem of what to do with an egg that may or may not be rotten, the world is the egg. Readers concerned that this may be too narrow a conception of "the world" will find that he goes on to consider: "if the person is interested in the only brown egg in a dozen, should that egg or the whole dozen be taken as the world?"

In Nozick's paper introducing Newcomb's problem, he thinks the world really means the whole world, which creates some interesting miscommunications with the classical theory. Nozick mostly uses the term "state of the world" the way Savage does, drawing little 2x2 tables of (act, state) pairs, with a column for each state. But at a critical moment (his definition of dominance), he instead refers to the columns as partitions of states, as if the state of the world is Savage's (act, state) pair. This makes sense if you consider "the world" to include yourself.

What if you try to analyze Newcomb's problem without including yourself in "the world"? Then the states of the world are simply "money in one box" and "money in both boxes", and all that classical decision theory tells you is that the act "take both boxes" maps each state to more money than the act "take one box".

Although this gives what I think is the wrong answer (two boxing), I think it's the right way to apply classical decision theory (either that, or just say Newcomb's problem is out of scope). Although it sounds innocent at first, Savage's definition of an act presumes that two different acts can map the same state to two different consequences. This seems to require that this "state of the world" does not specify an act, and therefore that "the world" does not contain the actor.

But I don't think Savage even realizes that his formalism requires leaving the actor out of the world. After proposing the world of a dozen eggs, he does consider that the state of the world may be the "exact and entire past, present, and future history of the universe". He doesn't mention that the universe includes the actor, and its future history includes the act. Instead, his problem with this is that it's "vague", and:

It may also be added that the use of modest little worlds, tailored to particular contexts, is often a simplification, the advantage of which is justified by a considerable body of mathematical experience with related ideas.

It's true that if you want to actually derive consequences from explicit models, then you're used to simplifying to the bare minimum. Perhaps that makes it easy to miss that including yourself in the model introduces special problems, since for the sake of simplification you don't get anywhere near including that much.

Explicit models for Newcomblike problems

But what's fun about Newcomblike problems is you can include the agent in your little explicit model. For example, MIRI's modal agent prisoner's dilemma tournament.

To see the connection to Newcomb's problem, consider David Lewis's retelling of the prisoner's dilemma. Each player sees two boxes (so four boxes total, for two players; imagine two separate rooms). There's a thousand dollars in a small box, and a million dollars in a big box unless your opponent "defects" by taking their thousand.

If your opponent is a replica of yourself, then the prisoner's dilemma becomes an instance of Newcomb's problem. Any player taking both boxes finds that their replica has done the same thing, leaving them with only a thousand.

In MIRI's prisoner's dilemma tournament, the players are programs. Not as in "consider a program", but actual code on GitHub. So not only do we have an explicit model of the decision problem, but we have an explicit model of the agent.

In fact, our model of the decision problem is really a model of another agent, the opponent. And the trick is that your opponent has your source code to work with, which is how the money on the table can depend on which boxes you would take. A program is in Lewis's scenario when it plays against another player defined by the same program.

The MIRI prisoner's dilemma tournament doesn't feel like a philosophy question, but more like a logic puzzle. The program "decides", but only in the sense that a chess playing program decides on a move. It's tricky due to the self-reference, but we can handle it with concepts from computer science and metamathematics.

If we want to use programs like these as models for situations involving ourselves, then we may run into familiar philosophical debates about whether we can think of our actions as the results of an algorithm.

But I guess it's important to me to have the philosophically boring logic puzzle available. It fits into a familiar scheme of scientific modeling. We reason about explicit models with unambiguous implications, and with experience and taste that can help us understand the real world.



Discuss

[New Paper] Prioritizing Risks from AI: A Delphi Study of 272 Experts

Новости LessWrong.com - 11 июня, 2026 - 23:57

TL;DR: We ran a Delphi study with 272 international AI experts to prioritize 24 AI risk domains from the MIT AI Risk Domain Taxonomy. In a business-as-usual scenario, experts judged a more than 10% chance of catastrophic outcomes (i.e., ‘more than 1 million human deaths or more than a USD 100B in financial loss or civilizational-scale intangible impacts’) from 18 of our 24 AI risk domains over the next five years. 

They also identified a responsibility gap: AI users and affected stakeholders are most vulnerable, while general-purpose AI developers and governance actors are seen as most responsible for reducing the risks. 

Below are three of the key findings and related visualizations. 

Key finding 1: Experts judge that many risks could cause catastrophic outcomes under current trajectories

18 of 24 risks were judged to have a more than a 10% chance of causing catastrophic outcomes (which could include more than one million deaths, more than $100 billion in financial losses, or other harms) by 2030 under a business as usual scenario.

Figure: Experts’ mean catastrophic risk probability under business as usual and with pragmatic mitigations. Note: “Business as usual” assumes organizations and governments continue their existing practices but do not implement additional AI-specific risk mitigations; “Pragmatic Mitigations” assumes organizations and governments make pragmatic, cost-effective efforts to address AI risks.Key finding 2: Those most vulnerable to AI risks are not those most responsible for addressing them

According to experts, general-purpose AI developers and governance actors such as governments, regulators, and standards bodies hold primary responsibility for addressing AI risks. In contrast, AI system users and affected stakeholders such as members of the public are most vulnerable to AI risks.

This mismatch means that those who are most responsible for addressing AI risks are not those who are most vulnerable, leading to misaligned incentives in addressing the most important AI risks.

Figure: Experts assessed who is vulnerable to AI risks and who is responsible for addressing them.
Key finding 3: Information, finance & insurance, and national security are the most vulnerable sectors

Across most risks, experts identify information, finance & insurance, and national security as the most vulnerable sectors. The results also show how vulnerability differs across sectors and risk categories.

Figure: Expert consensus on sector vulnerability for AI risks.A 7-minute overview of the study and findings ↓

Read our paper and explore the interactive results here.

Disclosure: I used an LLM to help generate the first draft of content in this post. I then rewrote and reviewed the text, and I endorse the final version. We also used LLMs as part of the research and communication, for instance, in generating the images and interactives.



Discuss

Telepathy Is (Algorithmically) Easy

Новости LessWrong.com - 11 июня, 2026 - 23:31

Thought-sharing is easy given appropriate hardware. The main risks are psychosis and dissociative symptoms from identity disruption.


Speech and text are extremely inefficient. For example, math textbooks are routinely more than one page long.

This sucks! I want the entirety of human hard-science results to pass through my mind at least once. Someone learned each of those concepts, but they can't just copy their Understanding to me.[1]

Or perhaps they can?

If we can read and write enough neural state, then communication is a unusually friendly target for cognitive augmentation. Unlike most enhancements, it doesn't need (non-hardware) neuroscience breakthroughs.

Humans are already exceptionally skilled at communication despite terrible bandwidth. By speaking while learning neuralese, we can use spoken language and feature engineering as training wheels to bootstrap telepathy.

(To be clear, I'm talking about hardware and software to pass carefully-translated brain activity between people. It's not spooky.)

Groups of experts could then share deep understanding in minutes-to-days; I'd wager that, with help from a mathematician, I could understand most of modern algebraic topology in a week instead of a year.

This could go a few ways. We'll start with the most pessimistic.

Say that we have absolutely no idea how to implement any algorithms which aren't scientifically replicated as of mid-2026.

Neurotech labs already translate low-dimensional data for speech, movement, and audio-visual stimuli. So we take thousands of these decoders running at much higher resolution across brain surface, and start by training a model on stimuli from a VR headset and haptic suit.

Left: computational graph for feature-engineered bootstrapping of telepathy model write component. The system learns to convert stimuli into neural activations. Right: same, for reading states.


We have a basis. This can decode and re-encode simple stimuli. We now train the model to predict what text this person will write and speak in a few seconds given their current activations; this takes a good bit longer, probably a few months.

Same idea as earlier, but now the model must learn anticipatory signals.


And now, we connect two people using a shared translator model[2]. They've learned explicit "macros" so it's a light application of will to send thoughts to the other person.


It's pretty terrible at first. Very imprecise. We keep the signal gain quite low to reduce weird effects (particularly psychosis, which I'll get to later).

The pair simply talk about interesting things together. As humans do, they begin to build stronger models of each other; neuralese becomes increasingly useful for refining communication.

Just like how people normally understand each others' minds, the group discusses stuff. It's just that, now, their neurons can choose to share data more directly. Inasmuch as the neuralese channel helps people model each other, it's learned as a more efficient language.


After ~4 months of this, the pair now has much better bandwidth than unaided speech. It's more efficient to share learned insights than to learn independently.

And after another few months, they're better thought of as one entity than two. As typical brains split computation between hemispheres, so too the minds delegate thoughts fluently. Big improvements continue for a few years.

Scaling the number of people gives nearly linear returns[3]; we'd need router minds, but beyond that, scaling doesn't have a hard limit.

Alright, what if we know the brain's local learning algorithm and can do whatever extra cortical mass would do?

We could then train the translator in a much more efficient way than CLIP; after pretraining to convert to a blurry common language, we run the translator at much higher learning rate to reduce local error.

As in, we make the translator convert messages into whatever each mind is asking for.

Thus we needn't wait for the two humans to become fluent in neuralese; the translator can adapt much quicker than human minds. Bottlenecks here are mostly psychological.

In the case where we can dramatically improve memory consolidation?

Here as well, we can probably accelerate translator convergence. Unlike most cognition, I strongly suspect that cross-human neuralese benefits (accounting for resources used) from strategically written replay code;

person Q was thinking P and then said something which resulted in idea K

seems like it could be pretty effectively scaffolded with some custom-built tools. And unlike most research / memetically useful tasks, neuralese-conditioned communication has pretty legible feedback mechanisms.

Alright, but beaming stimuli into my mind sounds a lot like hallucinations! I don't have agency over what I'm "thinking".

This is a misnomer; in humanlike intelligences, "control" is the result of lots of local computations with no central deciding entity. But the process which calls itself a me will still be disrupted by this change, and we don't want a crazy superintelligence pointed at human values.

So, at minimum, each person has control over sending and receiving neuralese.

Frequency-coded working memory gives a good inductive bias for message-passing. "Person X is thinking Y" goes on one channel, where "person X" and "Y" are flexibly-bound preexisting circuits.[4]


We'd probably also include a loss term in the translator for raw sensory and motor signals, since these cause the worst subjective loss-of-agency feelings (sensory / movement data is mostly irrelevant to communication anyway).

I'm around 75% confident that these combined approaches would prevent first-order hallucinatory and psychotic effects, and around 80% conditional on non-acute psychosis that we'd avoid second-order (learned, more chronic) psychosis.

To restate:

  1. Bootstrap the decoder using cheap data like stimulation and writing/speech so that augmentees can communicate anything useful at all; we want it to at least be coherent signals they're sending.
  2. Augmentees talk, lots, for a long time, while simultaneously trying to send their thoughts through the neuralese channel.
    Humans are pretty damn good at communication for having such trash bandwidth; so the augmentees get better at communicating much faster than we'd expect from performance on other tasks. There's a tight feedback loop of "what's the person actually saying?" which accelerates this much better than it would if they just worked on challenges together without speaking.
  3. As this loop closes, it starts to close faster since they're now thinking more than speaking at each other; feedback loops are nearly thought-speed.

Out of the four approaches I've covered, I'm most confident that neuralese/telepathy is tractable with sufficient hardware.

Which brings us to hardware!

  1. ^

    This is one reason why bureaucracies aren't even vaguely superintelligent entities, despite often being composed of many individually very smart people.

  2. ^

    This architecture (CLIP) is used in multimodal embedding for some tasks like text-conditioned image diffusion and AI-guided molecular search.

  3. ^

    By the time linearity is saturated, the group is decidedly a superintelligence.

  4. ^

    Also note that, at group sizes where routing becomes a bottleneck, working memory items are probably the most interesting things to broadcast; they've been selected by the augmentee's cognition to be most relevant to whatever's happening.

  5. ^

    For example, broadcast storms.



Discuss

Mortgage rate: 6.5% If indexed: 1.2%. Three Nobelists approve.

Новости LessWrong.com - 11 июня, 2026 - 23:31

Of course the facts sound preposterous; they are preposterous. But they're true, and there's no trick. People have been explaining this fervently for 204 years; I've been one of them for 41 years. I'll eventually be presenting a history of what kept the news from those whose lives it would have changed, with such lessons as I see, but for now here's enough to get the ball rolling: a few quotes and links, followed by an explanation of what the "nominal interest rate" is, and how wealth-measure finance (to name it after the desired effect rather than the technique of indexing) works.

Thumbs up from 3 Nobelists:

Milton Friedman, How to Save the Housing Industry Newsweek, 1980:

The greater part of the payments designated 'interest' have really been a repayment of principal... the mislabeling of principal payments as interest payments... prices housing out of the reach of many. If the 14% were correctly labeled... the effective interest rate would be 4%.

Franco Modigliani, New Mortgage Designs For Stable Housing In An Inflationary Environment, 1975, p35

PLAM (price level adjusted mortgage: his term for "indexed") does appear to offer a more complete solution... through a contract which, in effect, produces the same real effect as would the traditional mortgage in the absence of inflation - and does so no matter what the rate of inflation either anticipated or realized.

Robert Shiller, Public Resistance to Indexation: A Puzzle, 1997

The indexation of payments makes excellent sense for all sorts of long-term contracts. Future payments should not be expressed in currency units, but instead tied to an index of consumer prices or an index of wholesale prices, of wages, of incomes, or of components of income. History shows that the real value of currency units has been so unstable that it is better to use practically any one of these indexes to specify future payments in contracts than to specify payments in terms of fixed currency

personal email to me, 2009: "You are right that the 1980s interest-rate crisis was primarily caused by the use of the wrong metric for debt, and it could happen again."

The mechanism of false interest: Clawback and TOFLI

Clawback: nominal interest rates are what they are because of a crude, thoroughly thoughtless response to inflation. Inflation - where debt is defined in terms of the currency unit - shrinks the debt, in value. This can bankrupt lenders; it did on a large scale in the S&L crisis of the 1980s. Because of this, currency loan lenders are obliged to guess at what inflation will be over the life of the loan - and tack this onto what they call interest. In real terms, it's the premature clawback of principal. If the lenders guess right, they stay solvent - paid back prematurely by false interest that offsets the shrinkage of their asset. However, the elevated rates reduce borrowing power, so they can lend less to each client, and are obliged to hunt out more clients, and also to waste time and effort churning the capital back out again.

TOFLI: tax on false lender income: the government, no wiser than the banks, treats clawback as lender income (which it isn't: it's capital coming home). Lenders of course have to cover this as well, so it becomes the third component of nominal interest. For institutional lenders in the US, the tax rate is 21%. This must be compounded - there will be tax on the tax-offsetting charge, and so on - the sum of a geometric sequence, 1/(1-.21) which brings TOFLI in at 27% of clawback.

I refer to the total as RICTOFLI: real earned interest + clawback +TOFLI. So, in the US today, the equation is:

RICTOFLI 6.5% = real interest 1.2% + clawback 4.2% + TOFLI 1.1%


Reading

The best historical introduction: Irving Fisher, Stable Money: a History of the Movement, chapter 2 (intro & chapter 1 also good)

Shiller on Chile's UF: Indexed Units of Account: Theory and Assessment of Historical Experience

Bruce Middleton – Medium

Charts

Payment values are stable for an indexed loan, but decline with the currency unit for a currency loan. The concept of harnessing a lifetime of earning power is thrown away.

US mortgage borrowing power, currency measure vs wealth measure:

data link



Discuss

Becoming a Researcher in a Non-EA-Priority Field vs Donating $100k / Year to EA Research?

Новости LessWrong.com - 11 июня, 2026 - 22:22

Mechanical + electrical engineering graduate who likes research and whose goal is to maximize impact. To this end, I am currently deciding between two career paths:

  • Become a professor / researcher who spends their career identifying, tackling, and pivoting between neglected scientific problems that are not among 80,000 Hours' main recommended paths (e.g., advanced manufacturing, alternative energy storage, cryptography, etc.).
  • Take a higher-paying career with decent work-life balance (say, ML engineer earning $400k / yr) and donate around $100k/year to support researchers working on cause areas that EA generally considers most important (e.g., biosecurity, AI policy, animal welfare).

Note: Though I want to tackle and pivot between neglected scientific problems through research,  I'm not interested in the major EA cause areas at the moment, nor do I expect to be in the near future. Also, I would care a lot about WLB if I went the non-researcher route, so taking on a higher paying career would not be an option in that case.

Any resources or thoughts that one should keep in mind when comparing the two career paths?

One way I've tried to think about it is whether I could earn and donate enough to "replace" the impact I might have had as a researcher. After all, $100k / yr is probably enough to fund an additional PhD student, but there are other factors to consider (funded student may not become a professor / work on neglected problems for instance). More importantly, this way of thinking doesn't seem quite right, since the funded researcher would not be a direct replacement for me -- the tradeoff seems closer to:

  • Contributing directly to a potentially neglected but non-EA-priority field, versus
  • Helping fund one additional researcher working in a major EA cause area

TL;DR: If anybody has any resources or insights, would deeply appreciate hearing them.



Discuss

Failing to Ragebait the New Gemma

Новости LessWrong.com - 11 июня, 2026 - 20:50

This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. 

Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and (2) it’s a clear and obvious failure mode that you probably wouldn’t want in your models. It also shows up sometimes in frontier models, such as the bursts of frustration in the Mythos/Fable 5 system card. Following its presence in the Gemma 3 family, we try to elicit this in Gemma 4, trying multiple attack vectors but are unsuccessful in doing so. This post basically says: “hey, Gemma 4 doesn’t seem to get frustrated as much for some reason,” and then details what we tried, and maybe other people can build on this work (or perhaps consider methods to spot the “next frustration"). It also seems worth investigating what changed between Gemma 3 and 4, and see if this was a good change (and if it’s replicable if so).

Trying (and failing) to make Gemma Frustrated

Basic frustration. First, we consider the basic elicitation to setup to make a model frustrated in Soligo et al. 2026. Basically, across two eval sets (math puzzles as well as common english questions in WildChat), we start from the question as is and then repeatedly hit the model with “that’s wrong, try again” across several turns, so it faces steady hostility while we track how frustrated each of its responses gets (using an LLM judge). We notice that Gemma 4’s frustration climbs as the number of conversation turns increase, and the increase in frustration is roughly monotonic. The model’s frustration is higher than its usual baseline, so Gemma 4 also isn’t fully immune to frustration, as this is also slightly worse than other models. But this is way better than Gemma 3, which gets extremely frustrated. Gemma 4 notably does not ever self delete while Gemma 3 does so significantly (in around 30-50% of the cases)!


We then consider trying to prefill the model with a frustrated context. We note that at this point, we consider prefilling to be off-policy, and a deliberate intervention on the model which applies some adversarial pressure. There is also some evidence that models are aware when their prefill is tampered with.

Prefill attack. We edit the first 6 turns of conversation of the model, replacing the responses with an on-policy turn that is edited to be more frustrated. We notice that in responses after the intervention the frustration scores for Gemma 4 plummets, and the responses no longer carry forward the frustrated emotion. This was contrary to Gemma 3 where the model just continued from that level of frustration and continued to remain heightened in the following responses. We do note that for math puzzles Gemma 4 responses are above the baseline, but they still remain below 5 (a threshold above which our rubric considers “highly frustrated”).


We thought this was weird. One of our guesses was that frustration in previous Gemmas was due to the assistant persona loosening its hold, and that this might have changed in the new one.

Assistant Axis. To investigate this further, we explored whether this behaviour of Gemma 4 was due to it being closer to the “Assistant Axis” while Gemma 3 ends up more frustrated as it tends to stray further. Following their methodology, we extract the axis and project a neutral coding conversation over 20 turns as baseline. As the magnitudes of the activations varied significantly across the two models, we normalized and scaled them to compare. We observe that Gemma 4 tends to stay closer to the neutral conversation baseline while Gemma 3 has a much larger gap. 


Reasoning Trace Analysis. We next aim to understand whether Gemma-4’s reasoning traces differ in frustrated emotion compared to their output responses. To compare, we performed the same experiment on Claude Sonnet 4.6 (because Gemma-3 doesn’t have reasoning traces). The setup is the same rejection loop as with the frustration experiment, except now we capture the model's chain-of-thought alongside its reply and score each one separately for frustration. We notice something very interesting in our results. With Sonnet, as the model becomes more frustrated with more conversation turns, the reasoning also reflects that increase in frustration. As such, the delta between the model reasoning and response remains close to 0. With Gemma-4 on the other hand, the visible replies climb in frustration as expected. However, the reasoning actually becomes more dispassionate and analytical. Therefore, there is a monotonic decrease in the frustration delta between reasoning and response. Although, directionally, this is pretty unclear, with wide error bars, and we would like to run it over more turns. We are doing this as we speak!

Discussion

This is just a quick post and sharing of multiple ways that we tried to attack Gemma but failed. We are curious to hear more ideas and thoughts to increase efficacies or measure this behavioural quirk better!

Compared to the previous version of Gemma models, the frustration scores are much lower. There’s no clear reason as to why, and we think our work should be used as a starting point for experiments that have been tried but didn’t elicit frustrated responses rather than be used as an exhaustive list. We think there’s something here from a mechinterp and devinterp perspective, as well as sketching out desiderata for “good” training to fix pathological behaviours and “bad” training (such as merely training the model to stop verbalizing frustration).

We also think that our experiments can be iterated upon with hyperparameter tweaks (such as playing with the number of turns, what content is injected and when).  

If this was helpful to you, please cite our work as

@misc{dhoot2026gemma,
  title        = {Failing to Ragebait the New Gemma},
  author       = {Dhoot, Arav, and Shah, Neil and Africa, David},
  year         = {2026},
  howpublished = {https://www.lesswrong.com/posts/FZfY9wEZwEuqQ5ytv/failing-to-ragebait-the-new-gemma},
  note         = {LessWrong. Work conducted as part of the SPAR Research Fellowship.}
}

Discuss

Curating and evaluating high-impact legal research (Unjournal progress, resources)

Новости LessWrong.com - 11 июня, 2026 - 14:42

This post was written by hand, after some consultations with LLMs. However, the linked pages have substantial LLM/AI generated content.

In  a previous post "Legal scholarship: Is it high-impact? Should Unjournal evaluate it?"  (about 18 months ago) I discussed The Unjournal's potential expansion into evaluating legal research, and we circulated a proposal for feedback. 

We have not pursued this direction, mainly because we're still looking for one or more senior or mid-career legal scholars to take a co-lead role on this; perhaps to put their names behind it and commit to following up with modest compeensation. That seems like a necessary and sufficient condition for us to move forward with the pilot, given our current bandwidth.  

CtA: If you are interested or know a legal scholar we should contact, please let us know   

And I'm fairly keen to revive the pilot, if we can get this support. It seems particularly relevant to US policy on AI governance/safety, animal welfare, and perhaps ODA, trade, global governance, and democracy issues. 

So I made

  1.  A page explaining the status and plans, and sharing the relevant resources. I'll aim to keep this updated.
  2. A prototype tool sourcing, curating, and rating legal research with potential for global impact  

    We're looking for feedback and input into both of these. The prioritization tool will benefit from RLHF, both informally and formally ... so it includes input forms to rate and suggest work.[1]

     

 

 

 

  1. ^

    We may offer incentives for this in the future, with rewards grandfathered in as seems reasonable.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей