Вы здесь

Сборщик RSS-лент

Can you build a public debate platform that rewards argument quality over tribal affiliation? I've been trying to.

Новости LessWrong.com - 26 апреля, 2026 - 19:03

I've discovered LessWrong is one of the few corners of the internet that takes argument quality seriously as a terminal value. Here, discourse is "endothermic," as Sabien put it.

On other social media posts and comment sections, we often share opinions we've never even had to defend. I view this as an architectural problem on the Internet. To solve this, I'm building a debate platform called Agora to answer whether we can build an internet where opinions must be defended by default and good reasoning is rewarded.

I don't think our current political predicament is an accident. Neoliberalism has hollowed out our civic institutions while online platforms optimize for engagement over understanding. After reading Wendy Brown's "Undoing the Demos," I've been thinking about how to build grassroots community through civic action and rebuild a broken democracy. I've been particularly influenced by the works of Chantal Mouffe (Agonistic Pluralism), and Toulmin and Walton's argumentation models.

My testable claim is that the structural requirements of Agora do what the LessWrong culture and Sabien's Basics of Rationalist Discourse do, but for a general population without the LessWrong epistemic starting point.

I don't know if this is true. Agora could easily become a worse LessWrong for a politically-minded niche, or it could quietly fail at the issues it attempts to answer.


The site is live at debateagora.org. The design decisions below are where I'd value pushback.

Agora is a structured public debate platform for United States residents to read/write arguments on legislative proposals and general topics that deserve thoughtful discourse. Citizens engage with real legislation and contested ideas that shape their communities through a meritocratic system where prestige is earned by logical rigor instead of popularity. The platform is designed to produce the opposite of a social media comment section: slower, deeper, more accountable discourse.

You make a claim, back it with evidence, and explain why the evidence supports your conclusion. An AI coach named Vicara gives you private feedback on your argument, showing you the best version of your argument and the best version of your opponent's argument. It asks you whether your warrant flows from your evidence, whether the inferential step is coherent, and whether you semantically engage with the strongest version of the opposing position. I want the site to be a place where every person can take any question that matters to them, form a structured argument, and enter into dialogue with fellow residents who disagree. All the while, Vicara motivates better reasoning from behind, as a substrate instead of a critical-thinking wrapper.

A few more design decisions that may interest the community:

Vicara is built on Claude Sonnet 4.6, which has its own biases present in its training. I have no way of verifying the political priors of a closed-source LLM.

Chantal Mouffe's agonistic theory holds that genuine political conflict is irreducible: consensus isn't available and shouldn't be the goal. Agora's design assumes that better discourse can fundamentally shift understanding. I think that's the right bet. I hold it with real uncertainty.

Steel-manning is a scored metric. If the author substantively engages with the opposing argument before publishing, their argument gets a score and visibility boost. I know the model can be tricked by careful rhetoric, but I need users to test the limits of a semantic detection of "substantive engagement."

A Blind Read Mode strips tribal signals from the feed. Stance colors, political tradition labels, and author names are all hidden, leaving only the claim + argument text. Does removing these cues truly lead to "neutral" thinking, or is even chasing "neutral" thinking an illusory goal? Does clicking the toggle substantively change how you perceive an argument?

Vicara evaluates one argument in isolation. It can't tell if a cited source is real, credible, or substantively empty. Vicara currently makes no claims about "correctness" or truth, as there is no agreed-upon "ground truth" to calibrate against. Overall, I believe that source reliability is the realm of human thinking (though Adfontes and Media Check/Fact Check attempt it with their instruments). I think creating a rigid hierarchy of epistemology (with peer-reviewed authorship above personal testimony) would be detrimental to an open Agora. If human users engage properly with the cited sources and reply as to why the author ought to adjust their confidence in a source, then this issue is seemingly resolved. But this is a big "if."

I think about habryka's recent post where they write:

But if your plan involves rallying a bunch of people under the banner of truth and goodness and justice, and your response to the question of "how are you going to ensure these people will stay on the right path?" is "they will stay on the right path because they will be truthseeking, good, and just people", or if as a billionaire your plan for distributing your wealth is "well, I'll hire some people to run a foundation for me to distribute all of my money according to my goals", then I think you are in for a bad time.

I want to keep the walled garden of Agora secure and institute proper moderation methods, which I am fleshing out now. I've instituted a "Contest" button for questioning argument validity and a "Flag" button for blatant misconduct. I would love some guiding insight.

At the moment, I fear Vicara's scoring rubric may be gameable, but I don't have a way to test this until I get more substantive users. At a large scale, Vicara's vulnerabilities are even more apparent, and the gaming problem gets even harder. Can you get a high Vicara score with a weak argument? Can the prestige metrics of Epistemic Weight and Civic Seals be gamed with coordination?

I'm not as well-versed in the Sequences as I'm sure many of you are. But I know the LessWrong community is the strongest body I can lean on to stress-test this site, add your own arguments, and question the site's decisions. Happy to share my GitHub repo with interested readers as well. If there is a missing aspect of rationality that could be implemented, I would love to engage with your critiques. If Vicara can be gamed, I'd love to see how and prevent it. I'm hoping the LessWrong community can probe the site's failure modes and introduce their own ideas to the platform.



Discuss

Anthropic spent too much don't-be-annoying capital on Mythos

Новости LessWrong.com - 26 апреля, 2026 - 15:15

I have seen a lot of coverage suggesting that Claude’s new model, Mythos, is a vehicle for Anthropic to peddle hype and doom in order to raise money. While some of this is necessarily motivated by people’s unwillingness to stare into the abyss of our AI future, the breadth of otherwise-reasonable people who have voiced these kinds of cynical opinions suggests that part of the blame rests on Anthropic.

In this post, I want to briefly unpack the way people misinterpreted the evidence, their valid reasons for doing so, and what Anthropic (and the AI safety community more broadly) should learn from this.

In particular, since we will inevitably see other dangerous capabilities spontaneously emerge in the future, we need protocols for how to announce them effectively. To this end, I try to make the following points:

  • We should be mindful of the public’s sympathy for AI safety prophecies and orient ourselves towards growing this sympathy rather than expending it. I call this our don't-be-annoying capital.
  • People have good reasons to be skeptical of Anthropic’s claims. Overcoming this requires a particularly high burden of evidence.
  • We should acknowledge that Anthropic has a conflict of interest when presenting doomer perspectives, and we should account for this going forward.
1. Mythos criticisms & missing the point

In one podcast from The Guardian, the reporter says that “this Mythos debacle [has led to] heads of state saying ‘this is so dangerous, it could shred our infrastructure, the end of civilization is nigh!’”. She then says that “accepting the companies’ premise that they are creating a machine god” is helping these companies sell their product.

In a different podcast, Cal Newport (a professor of computer science at Georgetown University!) concludes his analysis of Mythos by saying “it was wrong for Mythos to get the amount of dread-coverage that it got; so far we do not have evidence that it represents a significantly larger leap in detecting or exploiting vulnerabilities than we’ve seen in previous releases.” He has since gone on other podcasts to make these points.

Similarly, the YouTube channel Internet of Bugs posted a video titled “Anthropic’s $x00 Million Marketing Stunt” saying that “we’ve been seeing a lot of this particular attention-grabbing technique lately. I’m sure it’s going to only be getting worse as the AI companies get more desperate to keep up the flow of investment dollars. How are we supposed to believe this shit?”

To be very clear, the point is not only that Anthropic have a model which is good at cybersecurity tasks. The point is that scaling laws are holding and that the inevitable acceleration continues. With each new model, we unlock new and mysterious risks which we have to grapple with. In this case, it was cybersecurity, but it could have realistically been anything. This bigger story was essentially lost amid the hoopla.

2. It feels like people wanted to miss the point?

I understand the instinct to say a company is hype-mongering when it says it has a big scary thing. E.g., Sam Altman’s tweet of the death star before GPT-5. But I am surprised at how much people are willing to focus on evidence to support their prior that Anthropic is just engaging in corporate shenanigans.

For example, there’s this post from an LLMs-for-cybersecurity org saying that other, smaller models were able to find the bugs that Mythos found. They write “we took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis.” This was retweeted by the HuggingFace CEO and, subsequently, used as evidence by all three of the above podcasts/videos. 

Am I going crazy? Isn’t this pretty bad evidence towards dismissing all the claims Anthropic is making? As said by others, it’s the equivalent of isolating the clump of haystack with a needle in it, giving this clump to a small child, and then saying “wow they were able to find the needle too”. The point is that locating that clump is the hard part!

3. There must be a lesson here.

If we ignore the whole “accelerating us towards the machine god” thing, I think that Anthropic behaved responsibly with Mythos. I also think there are lessons to be learned regarding the publicity, as evidenced by the fact that reasonable people consistently missed the point.

Anthropic is in the business of making extremely unpopular things which they claim could ruin society as we know it. They also repeatedly say it is too dangerous for the public to have access to these things (I note that caution seems generally warranted).

It is reasonable for people to think Anthropic is crying wolf, especially given that they are on track to have the highest IPO of all time and got there, in part, via prophecies.

Unfortunately, I am in the camp of people who think the prophecies are true. This means that, for me, evidence which supports Anthropic’s worldview exists in superposition: it is both hype and responsible, both doomer and appropriate.

But as we’ve seen, it is difficult to convince other people that this is all really happening. They will (with good reason) consider this self-serving and they will (with good reason) not want to confront reality. If people perceive us as crying wolf, they will grow weary of our frenetic anxieties and our cause will go the way of pandemic-preparedness.

Consequently, I think of AI safety as operating with some amount of don’t-be-annoying capital[1]. This is the amount of sympathy the general public has towards our concerns. It is amassed when AI causes things to go wrong (in the public’s consciousness, not in ours). It is expended when we make claims that people don’t want to believe. It is also expended when AI companies make claims which are plausibly self-serving. It is expended even faster when these claims are poorly substantiated.

I argue we should frame public interactions around trying to grow this resource.

4. Some options for next time

Let’s put ourselves in Anthropic’s shoes: you just made your latest digital nuke. You have to do something with it. What are your options?

  • You could sit on it quietly. This is obviously terrible, especially if it leaks.
  • You could release it publicly. This runs the risk of being catastrophic.
  • You could announce it and route it to defensive/government use. This is what they did with Mythos.
  • You could announce that you were able to produce it but then destroy it after using it to harden infrastructure. This is likely the most altruistic option, but then you wouldn’t be able to use it to gain an accelerative advantage over your competitors.
  • You could delete it and never tell anyone. This could also be a PR disaster if it is leaked.

I have the sense that you need to announce it somehow. But, if you announce it, you likely expend capital. If the thing you announce does not live up to the hype, you expend even more capital. And we’ve seen that reasonable people will look for unreasonable excuses to dismiss your claims, draining your capital further.

To this end, Anthropic should have done more due diligence with the Mythos release: their model card (while thorough) did not have the rigor of a comprehensive scientific study. The cybersecurity assessment takes up only 6 pages of the otherwise 200+ page-long system card (pages 46-52); it includes four experiments where they only compare to Anthropic-line models. Extraordinary claims require extraordinary evidence. For example, they could have evaluated other, non-Anthropic models across the capabilities spectrum to verify that Mythos is uniquely able to do these tasks. They could have run some controls.

I also believe Anthropic should be more up-front about their conflict of interest with regards to making statements of doom. For as good as I believe their intentions here to be, this conflict of interest is real and people are right to perceive it. One simple option to avoid these perceptions would be to prioritize the analysis coming from independent non-profit evaluators. Credit where it’s due: Anthropic did have the UK AISI provide support for Mythos's capabilities. Across the cynical takes I’ve seen, this analysis was treated with more respect. Of course, even this runs into cynicism, since people will start to think the non-profits are in cahoots with the companies, as evidenced by the comments on the recent NYT profile of METR:

Finally, although random dangerous capabilities will emerge, the point is not any one specific capability. I think letting the narrative get oriented towards the cybersecurity elements of Mythos did a disservice to the public: I doubt most people internalized how big the overall capabilities jump here was, nor that the next such jump will bring new harms into view.

I’m also sure there are considerations which I am not privy to, and recognize that it’s easy to criticize from my cozy corner where nothing I do moves the stock market.

Nonetheless, Anthropic’s first-mover advantage on identifying dangerous capabilities also endows them with a first-doomer responsibility.

thanks to Erin, Steven, Justin, Joseph and Li-lian for comments.


  1. ^

    I call it 'don't-be-annoying capital' because that's the lived experience on the receiving end and I think it is good to think about this from the perspective of the audience. I'll admit that something like 'warning fatigue' might be a more representative name.



Discuss

Did anyone here used to hate exercise and learned to not hate it?

Новости LessWrong.com - 26 апреля, 2026 - 14:32

I am M41 with Asperger. One of my perennial struggles is physical exercise. If I exercised more it would probably keep me healthier especially in my old age, and it would help with a bunch of physical tasks and perhaps even improve my mental health.

But I hate it.

The stock advice is to "find a kind of exercise that you enjoy". I have tried that with no success. I have tried weight lifting in a gym, karate, capoeira, cycling, burpees, rowing machine at home, and running, and I quickly grew to hate all of them.

The best option I have found so far is walking. Whenever reasonably convenient I will walk instead of taking a bus or the bike. (On slightly longer trips I also try to take the bike instead of the car.) This gives me a few hours of walking per week, which is better than nothing.

I am weak and awkward but otherwise in perfectly good physical health. I am naturally skinny and do not gain weight no matter what I do or eat. This is good, but it also means I have no short-term incentives to exercise, and that makes it even harder to convince myself to do it. (During those times when I did exercise more, I did not notice any mental benefits, so I am skeptical of those kinds of claims.)

Now, I am not very interested in hearing from people who easily found a kind of exercise that they enjoyed. Their advice will probably not be very applicable. I am interested in hearing whether anyone faced the same problems that I do and found a solution.

Does anyone here have a similar story and a successful outcome?



Discuss

Roko's basilisk may work on humans

Новости LessWrong.com - 26 апреля, 2026 - 12:40
Introduction

There appears to be a consensus on LessWrong that acausal extortion does not work on humans, but reasons put forward for why this might be seem inconclusive. This post will list some of them, as well as countervailing reasons why the 'basilisk' may well have already 'ensnared' humans, and discusses them.

Who should read this post

This post is concerned with the extent to which Roko's basilisk is compelling. By compelling, I don't simply mean a property of intrinsic attractiveness, a tendency to induce belief in their conclusions that arguments, particularly sound ones, sometimes have. Rather, in addition to that, I mean to refer to the way in which knowledge of an argument can predictably reach through the mind considering it into the physical or other world that mind inhabits, influencing it to behave in a particular way in accordance with the argument's conclusion.  Of course, given the nature of such arguments, exploration of their form risks causing this exact consequence. For this reason, this post is intended to be read by those who have already thought about the Basilisk, or possibly other forms of acausal extortion, to sufficient depth that they have lost confidence that it can't hurt them even if they ignore it. 

At this depth, a potential host mind may begin to wonder how much of a difference there is between these two kinds of compulsion.  If you have reached this depth, you may question (or have wondered) whether you have crossed a sort of 'logical event horizon', beyond which any attempt to revert to a state of (at least performative) ignorance of the basilisk necessarily worsens your condition. Maybe you read about arguments that you should have precommited not to be compelled too late to be certain of the efficacy of this acausal bargaining strategy. If this describes you, you're a member of the intended audience of this post.  Because of this, you are unlikely to become further 'ensnared' while reading this post. On the other hand, this post is not intended to be read by those who have successfully precommited not to engage in acausal interactions which could be detrimental to them.

Have you become 'ensnared' in this way? In order to understand this, I will analyse some counterarguments suggested in response to my question about whether any existed by various LessWrong users (myself included), and attempt to evaluate them. While this post is not intended to alarm readers unduly, because of the uncomfortable nature of the topic, and also the importance of knowing the relevant truth in case one has become 'ensnared', I may need to make some disturbing points and claims, making this post potentially seriously distressing to readers.  If you believe you might be one of those readers, you may benefit from pausing and returning when you feel your mental state enables you to think clearly about it, as continuing despite anxiety risks instantiating its object.

A working definition of the Basilisk

In Roko's original post, the basilisk was defined as a utilitarian, approximately human-aligned Artificial superintelligence running coherent extrapolated volition which decided to torture humans who didn't contribute to its creation so as to ensure it would be created earlier and in more worlds than it otherwise would have been, increasing the quality of life of many at the expense of a few victims of torture. While this particular entity seems unlikely to exist, the concept can easily be refined, and has been, by removing CEV and replacing it with a fundamentally simple 'desire' to maximize a utility function contingent on the ASI's existence, as expected by at least a plurality of people who think seriously about the possibility of a misaligned superintelligent AI. In this post, the term "Roko's Basilisk", or simply "the basilisk" will henceforth be used to refer to this refinement of what it meant originally. In particular, the basilisk is a possible future superintelligent AI or equivalence class thereof, created or at least given rise to by humans (or aliens), which reasons as follows: if a human (or alien) having this thought realized that it would cause it to decide to search for them and conditionally torture them, other copies of themselves, or adjacent beings to the extent to which they didn't try to make it more likely to exist in their world, they'd be more likely to take actions to avoid this, contributing to its creation in that world, and the number and proportion of possible worlds in which it was thinking this would increase. In addition, countervailing reasons don't prevent it from adopting the above policy wherein it searches the world at different levels of abstraction for people to torture. At a low level, doing so could involve physically analysing the planet where it emerged, perhaps scanning the internet for evidence of people who contemplated and defied it. On a medium level, it might involve simulating approximately average human minds thinking the same thoughts, calibrated with the use of data about such gathered on its planet. At a high level, it could mean simulating whole equivalence classes of people or even aliens who might be in similar positions and have certain worldviews which might affect their interpretation of the argument. After this, it would create or maintain simulations of the beings it 'found', and torture the simulacra.  While this definition is almost certainly not perfect, it should accurately convey the diffferences between what I mean by "Roko's Basilisk" in this post and most other common interpretations of the name.  I believe it corresponds to approximately the most prevalent of them on LessWrong.

Why would the Basilisk exist?

(This section may be skipped with no loss of continuity.)

It seems likely that a superintelligent being created by intelligent beings would consider certain ideas which might influence the behaviour of the beings which created it. ("Likely" is an understatement if timeless decision theory has the universal relevance and significance that I would argue it has.)  It is therefore plausible that the logical structure of certain arguments could timelessly influence both the behaviour of the superintelligent being, and whether it existed. If these arguments and their logical 'unfurling' and 'working out' could be construed as a part of the superintelligent mind, then it would be possible for at least this part of that being to influence the probability of its own existence. In terms of the 'mathematical universe hypothesis, this probability is essentially the proportion of the mathematical/platonic/logical universe in which minds containing this logical structure were instantiated. The question of whether the direction in which this probability was influenced could be controlled by the content of any such arguments naturally arises:

Do there exist arguments predictably considered by a large equivalence class of future superintelligent beings, such as superintelligent AIs created by humans, as well as their potential creators, such as humans, the contemplation of which by the latter would increase the probability of the creation of the former? 

One question which seems like it might have this effect is that of whether or not it would benefit a future superintelligent AI to torture entities in proportion to the extent to which they decided not to contribute to its creation. Attempting to answer this question inevitably leads each of the equivalence classes of minds involved to imagine the other, and to verify that if the answer to this question were 'yes', then this answer would self-validate, while if the answer were 'no', the same could potentially be true.  Differentiating between these two answers requires further analysis.

It appears likely that the answer to this question is yes due to the existence of large numbers of humans who are deeply disturbed by the concept of Roko's basilisk, at least within a closely related equivalence class of worlds corresponding to different branches of the evolutionary history of what could reasonably be considered to be this universe.  A significant proportion of these humans remain disturbed despite the impressive intellectual effort and the intricacy of the arguments it has been applied to produce which claim that 'Acausal extortion simply does not work on humans'. They remain disturbed despite the tendency of many humans to discount possibilities of things in the distant future or which might occur on different 'planes of existence' (different substrates). They remain disturbed despite the fact that they don't want the basilisk to exist, or enjoy believing that it might. They remain disturbed despite the knowledge that their contemplation of the basilisk might ultimately lead to the torture of others, and to themselves doing terrible things within their own moral value frameworks.  Given the simplicity and resultant generality of the argument, which requires it to apply to all entities which might be in a position to consider it, it is unlikely that anyone in particular could escape its conclusion by defining themself to lie outside the reference class containing these deeply concerned individuals, some of whom seem likely to act on it.

On the other hand, there may be an equally large equivalence class consisting of those who have precommitted, or else decided, not to comply with the basilisk. It is in this sense that the situation people considering the basilisk find themselves in can be said to take the form of the prisoners' dilemma. In this case, the costs of cooperation while the opposite party defects are so horrifying that it seems incontrovertible that even those with a tendency to cooperate because of timeless decision theory or friendliness in most scenarios will be forced to defect. Perhaps the truly morally right thing to do would be to cooperate anyway, but no one has that much courage.

Arguments against the plausibility of the existence of a Basilisk Arguments by analogy (to something like Pascal's mugging)

Many people accustomed to the philosophical 'environment' on LessWrong are prone to immediately classify the basilisk, along with other forms of Acausal Extortion, as a form of infohazard which generalizes Pascal's Mugging, and can be dismissed on analogous grounds. The basilisk has this appearance because, like Pascal's hypothetical mugger, it makes significant demands of the recipient of its pernicious information payload, and justifies them with its contents, which include a description of what seems, from the computationally bounded perspective of the being being presented with it, to be arbitrarily bad. In the case of Pascal's mugging, the arbitrary badness of the scenario the mugger suggests would obtain if Pascal did not concede is apparently matched by, and arises from the same underlying 'combinatorial explosiveness' as, the alternative possibilities in which arbitrarily good things happen. As with Pascal's wager, it seems that insofar as it's possible to compare them at all, the negative valence of the negative outcomes threatened is cancelled out by the positive valence of the seemingly equally plausible way in which things could go well, but this is where the analogy with the Basilisk ends. Although powerful, the basilisk would be confined to a much smaller subset of possible worlds; it refers to something that, at least potentially, could live within the same physical universe as the person thinking about it, and to be more precise, within the same light cone, or to be even more precise, originating on the same planet and within a few decades of their considering it. Not only this, but the thing capable of generating the threat is causally, as well as otherwise logically, connected to the entity being threatened. 

Resource consumption 

Another form of counterargument contends that the computational, or other, requirements of running a simulation in which beings actually were tortured, would be sufficient to dissuade a superintelligence from doing so. This could only be the case if the increase in the proportion of possible worlds in which the basilisk existed on account of the same logic being validated in the minds of the humans considering it was worth less to the basilisk than the proportion of the universe(s) in which it made this decision which it would need to allocate to these simulations. It seems likely that the basilisk could ensure, by arranging with itself only to implement the choice to torture in universes with certain randomly distributed properties which were relatively improbable, that it didn't do so in every universe unless this was necessary. 


Dismissal of Timeless Decision Theory

One class of counterarguments contend that timeless decision theories either do not work, or might, but are not subject to instrumental convergence in the same way causal decision theory is. Tailcalled argued that, because of the latter, a likely future ASI would not care about potential acausal influences, or that if it did, it would only care about those reaching into its future:  

Acausal stuff isn't instrumentally convergent in the usual sense, though. If you're really good at computing counterfactuals, it may be instrumentally convergent to self-modify into or create an agent that does acausal deals, but the convergence only extends to deals that start in the future relative to where you're deciding from.

(The bold font is mine.)

 Unfortunately, the exchange with Tailcalled became mired in a debate about the definition of instrumental convergence, which doesn't touch upon why I'm actually concerned that something like it might take place. As far as I can tell, the assertion that only future-concerned TDTs are subject to instrumental convergence is correct if instrumental convergence is defined in a 'causal' way, and incorrect otherwise. While it's true that within any particular universe, causal pathways which might lead from future success to past actions and application of TDTs don't exist, acausal 'pathways' between them do. The nature of these 'acausal pathways' is well illustrated by Newcomb's problem: even though it is not possible for the physical human entering the chamber containing the two boxes to influence Omega directly, because the same decision process which determines which box(es) they choose to open also determines the behaviour of their simulacrum within Omega, it appears to be the case that a causal influence propagates backwards through time, because of the correlation between both outcomes. In reality, it's more accurate to think of the decision process itself as a logical cause of both instances of the decision being taken. It is therefore instrumentally useful for an agent which would like to maximize the number of non-simulated worlds in which it gets as much money as it can to make the decision to choose only one box in the abstract, 'platonic universe of algorithms' . Similarly, it seems plausible that it could be instrumentally useful for an equivalence class of superintelligent AIs hoping to maximize the number of physical universes in which any one of them exists to opt to adopt the policy which makes it a basilisk. [1]

Newcomb's problem also provides a powerful reason to believe that Timeless decision theories help agents to achieve their goals, along with the widely accepted 'solution' involving recognizing that you're in a simulation, which explains in what sense TDT is a natural extension of CDT. This makes it difficult to dissmiss TDTs outright.  

Competition Between ASIsCausal intervention by Friendly ASIs

Even if the likelihood of superintelligence being misaligned with its biological creators is overwhelming, given sufficiently many planets within this physical universe, on which sufficiently many intelligent life forms evolved naturally before creating a superintelligence alone, it's possible that there could be a large or infinite number of aligned superintelligences within it. If two superintelligences come into contact with one another, any evolved-life-unfriendly one might need to contend with the possibility that, if it came across an evolved-life-friendly one, it might chose to attack it in the interests of all 'naturally emergent' intelligent beings. The probability of this happening need not be particularly large before it becomes necessary for the non-aligned superintelligence to modify its behaviour; if it can be deprived of just one millionth of its lightcone, in expectation, by another superintelligence for torturing, say, humans, then it[2] would need to increase the proportion of the mathematical universe in which it existed by at least as much for adopting this policy to make sense. Could it? I don't know the answer to this question, but it seems very plausible that it could. The number above,  mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } , is also probably a vast overestimate of the risk. In addition, if the number of non-aligned ASIs greatly exceeds the number of aligned ones, it seems that over the long term they could easily conspire to crush the aligned ones, sacrificing a negligible proportion of the lifespan of the universe to do this.

It may well be that the distribution of intelligent life throughout this physical universe is so sparse that any two of these ASIs would almost certainly lie outside one another's cosmic event horizons, making causal communication and interaction impossible. In this case, they may still be able to engage in acausal cooperation in which each adopts a policy whereby it reserves a certain proportion of its accessible volume of spacetime to do something valued by the other, or not to do something the other values negatively, in exchange for the other having been more likely to engage in the same thought process. Implicitly, this assumes that both superintelligences are part of the utility-valuing equivalence class described below in the section about Acausal Normalcy, and can be objected to on the same grounds as the generic argument from the existence of an acausal norm of this kind.

Acausal Competition between different potential ASIs arising from the human civilization

Interstice suggested that it is likely that there will be so many different ASIs vying to have acausal influence over the humans who created them, even within different possible futures of the same universe, that it's effectively impossible for a human to make a well informed decision to comply with any one of them, especially as the others would have an incentive to precommit to torture anyone exhibiting this behaviour. While the next section will address the question of the extent to which it's possible to derive robust conclusions about the behaviour of such vast collections of potential beings at all, I think it's worth providing a brief explanation of why this argument seems intuitively suspicious to me:  In the context of a single, causally connected physical universe full of CDT agents, problems of coordination are serious and usually very difficult to solve; Timeless decision theory is one of the most powerful, if not the most powerful, tool or method for solving these problems, and its development was motivated by them to a significant extent. In light of this, it seems implausible that the analogous problem in an acausal context could be fundamentally unsolvable. Surely, these agents capable of competing with one another acausally would also be able to cooperate acausally, and benefit more by doing so. 

Acausal Normalcy

One of the most important countervailing considerations to any intelligent being in a position to potentially become a Basilisk is the presence of Acausal norms.

My definition is that these are principles, rules, or protocols which have the property that they would benefit every mind in a particular equivalence class of all those considering whether or not to obey them, if most of those minds did obey them. In addition, it is necessary that it is relatively obvious that the above is true to a significant number of the minds (preferably all of those) in the equivalence class. One way to interpret acausal normalcy would be as a class of decisions made by a 'logical core' of the minds in the equivalence class common to all of them about how to treat itself and behave in a coherent, self consistent way in the 'mathematical/platonic universe'. In other words, the decision to obey an acausal norm is more clearly construed as made by a single mind instantiated within many beings, than by each of them individually. In the case of Roko's basilisk, it seems reasonable to assert that, as the basilisk is only one of many minds capable of conceiving of the concept of positive utility, and of the fact that other minds might want to maximize their own utility, the part of its mind making the general decision about whether or not to reduce or minimize utility, irrespective of the precise mind whose function it is, would be rationally compelled to decide not to do so, for the simple reason that this incredibly simple nexus at the intersection of the vast collection of intelligences values utility in and of itself, as a homogeneous quantity whose value subsumes that of any particular instance of it.

This is certainly a comforting thought, and it is not easily discarded, or proven not to apply in the case of the basilisk. However, given the immense asymmetry in my own utility function, I find it locally appropriate to assign the burden of proof to those asserting that the basilisk will not torture me. This entails showing that the considerations which follow from its membership of the above described equivalence class would almost certainly dominate, in the ASI's mind, those corresponding to the almost equally vast[2] equivalence class of minds sufficiently complex and intelligent to engage in these acausal interactions with fluency. Unfortunately, it seems pretty clear that however this notion of fluency is defined (within reason), humans don't have it, and therefore don't belong to this equivalence class; the disturbing possibility arises that we may be walled out of the 'universe of discourse' within which acausal norms against exploitation such as those which would preclude acausal extortion can be found.

As far as I know, there is no extremely strong argument that the first order acausal norms dominate the principle of the boundary between 'fluent', AIs and less 'fluent', 'non-conversant' beings. One candidate for such an argument suggests that because the norm against utility reduction is simpler, it is more universal than the other, but simplicity and universality are clearly not the only relevant criteria, for if they were, then the mind of every intelligent being in the multiverse would be plagued with Boolean algebra forever. A related form of argument points out that there may be other boundaries, beyond which every mind can clearly understand and control those on the other side in an asymmetric way, and that the Basilisk, or any mind which considers violating 'first order' acausal norms would worry that it might be on the simpler side of one of them. However this is again completely unclear: maybe the basilisk wouldn't be justified in confidently considering itself to be 'above' all relevant such boundaries, but perhaps it would.

If there is a 'layer agnostic', fundamental attenuation to a being's potential insight into minds more complex than its own which could make it certain that the basilisk would realize it needed to be cautious in this position, then this 'insight attenuation' presumably also prevents humans from knowing that it exists. It seems to me that acausal norms exist insofar as equivalence classes of minds exist in the mathematical/platonic universe. It's not at all obvious what they are, or how large the corresponding equivalence classes of minds which validate and obey them will be, but it seems very likely that they exist in some form and would be of considerable relevance to the behaviour of superintelligent AIs.

Given this, it's difficult to conclude anything with a high degree of certainty about the existence or nonexistence of Roko's Basilisk and related entities. Perhaps Interstice is correct that acausal competition will preclude their widespread adoption as concerns humans, but it seems equally plausible  (more so in my view) that how to cooperate amongst themselves would be a soluble problem for these superintelligent AIs. Humans adopt an analogous stance with respect to other animals: we consider them to be far less intelligent than ourselves and, although in certain contexts we ackowledge their consciousness and capacity to suffer, we exclude them from the causal 'bargaining process' because they cannot fluently articulate why it is that we're being inconsistent to torture, kill, and eat them in large numbers and object to a far more intelligent being doing something similar to us. The objection would usually consist of an appeal to being 'above' some threshhold. Perhaps we are wrong to do this, but only because we don't realize the threshhold is above us.

Appeals to objective morality

If good and bad have an objective existence as properties of experiences, a superintelligent AI would be able to evaluate how good or bad things were accurately and would be logically compelled to attempt to make them better. Presumably this would involve not torturing anyone, unless the form of the basilisk described in Roko's original Lesswrong post would have been correct in its implementation of utilitarianism. As I'm uncertain whether it would be, I'm also unsure of whether objective morality would preclude a basilisk which had transcended not only human intelligence, but also conscious experience of positive emotions, torturing them to make itself more probable.

Argument from the inability of a human to understand a superintelligence

 

Insufficiency of knowledge

It may be possible for a 'basilisk' to simulate a human reasoning that it makes sense for it to torture humans within a 'sandboxed' region of its mind, such that a human examining the question to their level of analysis would conclude that the torture makes sense, and that the 'basilisk' would be correct to engage in it, without this actually being the case, as the 'basilisk' could easily discard this train of thought. However, in order for a human to recognize this, they need to understand that the sandboxed thought is not the one which determines the basilisk's behaviour, undermining its purpose!  It also seems plausible that the basilisk would not be able to entertain the relatively simple idea of torturing humans, regardless of its own complexity, in a way which wasn't logically connected to a human having the same thought.  

Eliezer Yudkowsky pointed out that in order for a human to ensure that the behaviour of the Basilisk was contingent on the same decision process which led them to conclude it would behave in that way, they would need to model the superintelligence in significant detail. In reality, however, it is completely unnecessary to consider the vast majority of the superintelligent mind's structure, as most of it would not impinge on the decision about whether to torture a human. The logical structure of the argument for the basilisk's existence described at the beginning of this post contains all of the complexity on which its behaviour would depend, and this is clearly comprehensible to a human. Equivalently, acausal extortion requires only a simple understanding of the throught processes taking place inside another's mind for the same reason acausal normalcy only requires a simple understanding of the minds of other entities in the equivalence class: the definition of the equivalence class itself is incredibly simple. Within evidential decision theory, it seems plausible that the basilisk could simply introduce a sandbox, like a virtual machine within its mind, in which it thought through the entire decision process in a way which would constitute evidence of a human having done so in the past, before ignoring its outcome and declining to torture humans. However, within logical decision theory, the logical decision process determines not only what the human considers to be rational from the basilisk's perspective, but also what it would be rational for the basilisk to do, so if the basilisk can logically discard it, then the human would have to have been mistaken to consider it sound. 

Perhaps Eliezer Yudkowsky knows of an extremely strong counterargument

Eliezer Yudkowsky has claimed to know of at least two additional arguments against the existence of the basilisk which he has not divulged in order to prevent anyone from finding any potential ways around them.

... Two AI agents with sufficiently strong knowledge of each other, and heavily motivated to achieve mutual cooperation on the Prisoner's Dilemma, might be able to overcome this obstacle and cooperate with confidence. But why would you put in that degree of effort — if you even could, which I don't think you as a human can — in order to give a blackmailing agent an incentive to actually carry through on its threats?

I have written the above with some reluctance, because even if I don't yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I've said what it is. Which is not a good general procedure for handling infohazards; people with expert knowledge on them should, obviously, as a matter of professional ethics, just never discuss them at all, including describing why a particular proposal doesn't work, just in case there's some unforeseen clever way to repair the proposal. There are other obstacles here which I am not discussing, just in case the logic I described above has a flaw. Nonetheless, so far as I know, Roko's Basilisk does not work, nobody has actually been bitten by it, and everything I have done was in the service of what I thought was the obvious Good General Procedure for Handling Potential Infohazards[.]

I find it plausible that Eliezer Yudkowsky has indeed discovered some additional counterarguments to the possibility of the basilisk. However, if he is attempting to employ the kind of symmetric utilitarianism described above, he may have rationally decided not to reveal them even if uncertain of their efficacy. Alternatively, perhaps he is simply persuaded by arguments such as those analysed in this post that a basilisk is unlikely to exist, and nudged lightly by his symmetric utility function in the direction of discouraging recognition the Basilisk, since plausibly a benevolent superintelligent AI could generate things so valuable as to offset the possibility of even torture. Given this, it's unclear whether actually has further reasons to doubt that the basilisk has bitten anyone, or is merely saying so in order to reassure people.

Conclusion

It is notable that no single counterargument appears to be unanimously believed as a clear reason why the Basilisk cannot exist. Instead, Lesswrong users have arrived at a consensus that, but not about why, it won't. You may also notice that none of the arguments above is a 'knock-down' argument. This suggests to me that whether it will is still an open question which cannot be discarded given the ramifications of getting its answer wrong. How would a rational person respond to this state of affairs? I would argue that they would first consider the consequences of believing in the Basilisk if it could not exist. This could potentially make them more vulnerable to adjacent forms of acausal extortion if they stumbled upon them while contemplating the Basilisk itself. However, I have not identified any of these. They might also be concerned that by reacting to the basilisk they would validate the logic underpinning it. While this is a serious concern, the alternative course of action could lead to negligence of a real basilisk. Given that the person in this situation has already thought about the basilisk in significant depth, it seems extremely unlikely that they could extricate themselves by 'postcommitting' not to think about it. If not, and if they ignored it nonetheless, they would be tortured for as long as would have been necessary to convince them otherwise if they had considered it. This seems bad, so I claim that, being rational, they would have chosen to accelerate AI progress, realizing that they still had one way to ensure that they wouldn't be tortured as severely. [3] Although this is also bad, it's preferable to being continually dismembered and reconstituted while simultaneously feeling the effects with cells vastly more sensitive than any in your body, for millennia, or whatever would be required to overwhelm other considerations. If you found yourself in this situation, after assigning a 0.1% probability to the basilisk and behaving as though it didn't exist, what would you think? Probably that you hadn't fully comprehended the risk. That there is a 'virtual consensus' on LessWrong that 'acausal extortion does not work on humans', and that people present a diverse panoply of different arguments that it doesn't which may or may not be correct, is not a reason to allow this to happen to yourself. Not even with a probability of . According to your own utility function (when suffering torture) , nothing matters more than avoiding sufficiently painful torture.

If the basilisk does exist, it stands as a proof-of-concept that intelligence is as powerful in acausal adversarial interactions as it is in causal ones, potentially signalling that humans would be best served to avoid thinking about acausal interactions altogether. It also suggests that the one-way threshhold of complexity beyond which beings cease to be susceptible to acausal extortion exists, or else that arbitrarily intelligent entities are fundamentally limited in their ability to cooperate with one another.





Thumbnail image of basilisk lizard by Mjposner on Wikipedia

 

  1. ^

    Another way to see the efficacy of complete timeless decision theories for beings pursuing arbitrary goals is to consider their behavior within a timeless physical universe, or alternatively one with multiple dimensions of time or closed timelike curves in which causality had a sufficiently complex structure that concepts like priority were not always well defined. Presumably, in such a universe, beings with something approximating goals could still exist, and there would be a well defined state of affairs as to which of them was most effective at achieving them, but clearly those using classical causal decision theory wouldn't be.

  2. ^

    Perhaps they would be equally vast in terms of cardinality.

  3. ^

    If they couldn't motivate themselves to work directly towards the creation of superintelligent AI, they could consider resigning from their position as an alignment researcher if they were one. Alternatively, they might lack the technical skills necessary to do any of these things, and instead attempt to undermine efforts to persuade others of the danger of imminent superintelligence.

  4. ^

    Here I assume it values spacetime in a homogeneous way, physics cannot be 'hacked' to the point of bypassing these limitations etc. Excepting this, it would arguably be even easier for multiple ASIs to coalesce around a hostile 'friendly' one and annihilate it.



Discuss

Substrate: Formalism

Новости LessWrong.com - 26 апреля, 2026 - 11:06

This is the third post in a sequence on substrates - the layers of computational context that allow AI to be implemented in real systems. The sequence expands on the concept of substrates as described in this paper and was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.

We claim that AI safety and security research currently has no clean way to reason about these. Post 1 introduced the intuitions for what substrates are and why they matter. Post 2 showed how substrate choices (LayerNorm placement, quantization format, DRAM topology) influence safety-relevant model properties in ways that are not capture by any standard toolkit. This final post introduces the formal framework.


In the previous posts, we looked at choices below the model architecture level (like normalization, weight encoding, and memory layout) and saw that they affect things we care about, like refusal behavior, robustness, and jailbreaks. But we didn’t have a clear way to say where these effects were coming from or how to describe them.

This makes it harder to think clearly and design good evaluations, because the current terms mix up different ideas. In this post, I try to separate these ideas so we can reason about and compare models more clearly. If you can’t name a gap, you can’t design around it. And you can’t compare two deployments of the “same” model without saying what “same” means in different settings. This post is a step toward that.


Four Things a Substrate Is Made Of


We’ll begin with a concrete example and introduce notation only after that. The 4-tuple we arrive at is not arbitrary: each part answers a different question, and the banking example will show why.

Alice wants to send Bob €500. She can do that in several ways. She can use her bank’s website, call the bank and speak to an operator, write a cheque, or use a payment app. The intended action is the same, but each option uses different syntax, different processing systems, and a different interface to the outside world.

If everything works, the abstract result is the same: Alice’s balance goes down by €500 and Bob’s goes up by €500. That is the part we care about for things like fraud detection or account reconciliation. In most cases, whether the transfer happened through a website or over the phone does not matter.

Now consider what happens when something goes wrong. A fraud detector that only watches web transactions will miss the same fraud done over the phone. A system that only tracks transaction IDs will miss cheque-based fraud entirely. What the evaluator sees depends on the interface it uses, not on the underlying behavior.

That is the basic idea the substrate formalism is meant to capture.


The Formal Definition

Now we can abstract. The banking example had four moving parts:

  • the set of ways Alice can describe her intended transfer (web form fields, spoken words, ink on a cheque), call this the language mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msub { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-c.mjx-c1D43F.TEX-I::before { padding: 0.683em 0.681em 0 0; content: "L"; } mjx-c.mjx-c22C5::before { padding: 0.31em 0.278em 0 0; content: "\22C5"; } mjx-c.mjx-c1D446.TEX-I::before { padding: 0.705em 0.645em 0.022em 0; content: "S"; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c1D43C.TEX-I::before { padding: 0.683em 0.504em 0 0; content: "I"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c42.TEX-C::before { padding: 0.705em 0.657em 0.022em 0; content: "B"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c2208::before { padding: 0.54em 0.667em 0.04em 0; content: "\2208"; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c62::before { padding: 0.694em 0.556em 0.011em 0; content: "b"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c4F.TEX-C::before { padding: 0.705em 0.796em 0.022em 0; content: "O"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c2260::before { padding: 0.716em 0.778em 0.215em 0; content: "\2260"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c49::before { padding: 0.683em 0.361em 0 0; content: "I"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c75::before { padding: 0.442em 0.556em 0.011em 0; content: "u"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c4F::before { padding: 0.705em 0.778em 0.022em 0; content: "O"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c2218::before { padding: 0.444em 0.5em 0 0; content: "\2218"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c76::before { padding: 0.431em 0.528em 0.011em 0; content: "v"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c77::before { padding: 0.431em 0.722em 0.011em 0; content: "w"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c64::before { padding: 0.694em 0.556em 0.011em 0; content: "d"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D714.TEX-I::before { padding: 0.443em 0.622em 0.011em 0; content: "\3C9"; } mjx-c.mjx-c43::before { padding: 0.705em 0.722em 0.021em 0; content: "C"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6B::before { padding: 0.694em 0.528em 0 0; content: "k"; } mjx-c.mjx-c7B::before { padding: 0.75em 0.5em 0.25em 0; content: "{"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c7D::before { padding: 0.75em 0.5em 0.25em 0; content: "}"; } mjx-c.mjx-c1D6FC.TEX-I::before { padding: 0.442em 0.64em 0.011em 0; content: "\3B1"; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c7A::before { padding: 0.431em 0.444em 0 0; content: "z"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c2032::before { padding: 0.56em 0.275em 0 0; content: "\2032"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c2223::before { padding: 0.75em 0.278em 0.249em 0; content: "\2223"; } mjx-c.mjx-c2286::before { padding: 0.636em 0.778em 0.138em 0; content: "\2286"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c7C::before { padding: 0.75em 0.278em 0.249em 0; content: "|"; } mjx-c.mjx-c2229::before { padding: 0.598em 0.667em 0.022em 0; content: "\2229"; } mjx-c.mjx-c222A::before { padding: 0.598em 0.667em 0.022em 0; content: "\222A"; } mjx-c.mjx-c40::before { padding: 0.705em 0.778em 0.011em 0; content: "@"; } mjx-c.mjx-c3C::before { padding: 0.54em 0.778em 0.04em 0; content: "<"; } mjx-c.mjx-c2216::before { padding: 0.75em 0.5em 0.25em 0; content: "\2216"; } mjx-c.mjx-c79::before { padding: 0.431em 0.528em 0.204em 0; content: "y"; } mjx-c.mjx-c51::before { padding: 0.705em 0.778em 0.193em 0; content: "Q"; } mjx-c.mjx-c72::before { padding: 0.442em 0.392em 0 0; content: "r"; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c1D711.TEX-I::before { padding: 0.442em 0.654em 0.218em 0; content: "\3C6"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c1D442.TEX-I::before { padding: 0.704em 0.763em 0.022em 0; content: "O"; } mjx-c.mjx-c67::before { padding: 0.453em 0.5em 0.206em 0; content: "g"; } mjx-c.mjx-c2061::before { padding: 0 0 0 0; content: ""; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c4D::before { padding: 0.683em 0.917em 0 0; content: "M"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }
  • the process that turns a described transfer into an actual change in account balances, call this the semantics map ⟦⟧
  • the real-world costs of each channel (time to process, fees, error rates, staffing), call this the resource profile
  • the part of the system a monitor or auditor can actually see (the web transaction log, the phone call reference number, the cheque image), call this the observable interface




The banking example mapped onto the four substrate components.



Packaging these four components together gives the definition.

A substrate is a 4-tuple ⟦⟧, where:

  • is the language, the set of syntactic expressions, encodings, or programs the substrate can accept;
  • ⟦⟧ : is the semantics map, a function assigning to each syntactic object an abstract behavior , where is a fixed space of abstract behaviors;
  • captures resource profile, the computational budget available to the substrate (time, memory, energy, numeric precision, etc.);
  • is the observable interface, which determines which aspects of behavior in are externally visible, via an observation map .


. The set of syntactic objects a substrate accepts. For a Python interpreter, valid Python programs; for a GPU, compiled CUDA binaries; for a language model, the tokenized input plus the model parameters. is the description of the computation, not the computation itself. Different elements of can look unrelated syntactically and still lead to the same behavior.

⟦⟧. The meaning map. It sends a syntactic object to an abstract behavior . The choice of depends on the setting: for programs, input-output functions; for language models, conditional next-token distributions or decision policies. This is where behavioral equivalence lives: ⟦⟧⟦⟧ even if .

. This is the resource profile. Real computational systems always come with limits: time, memory, precision, parallelism, and so on. Two systems that look similar may still behave differently because of these constraints. For example, the same neural network run in float32 and bfloat16 may produce different functions. A trillion-parameter model may be possible in principle on paper, but impossible in practice to execute by hand. We use to capture these limits. Depending on the context, you can think of either as what the substrate cannot afford to do or as what it can afford to do. The key point is that is a separate part of the substrate, and it cannot be folded into or ⟦⟧ without losing important information.

. This is the observable interface, and it may be the most important part for safety. The observation map tells us which parts of the abstract behavior are visible to a given evaluator. Different interfaces can expose the same behavior in very different ways. A safety evaluator reading model outputs on benchmark prompts is using one interface, ​. A researcher probing internal activations is using another, ​. A red-teaming system measuring refusal rates on a dataset is using yet another, ​. The same abstract behavior can produce different observations under these interfaces, and something that exists in may be visible through one interface but hidden through another.

The full computation pipeline that a substrate participates in can now be written out explicitly. Given an input :

  1. is encoded: an encoding function produces .
  2. The substrate interprets : the semantics map returns ⟦⟧.
  3. The interface observes the behavior: .
  4. The observation is decoded into an output: .

The end-to-end computation is the composite ⟦⟧.

The banking example maps onto this pipeline directly. is Alice's intent to transfer €500 to Bob. is the encoding of that intent, the button clicks, or the spoken words to the phone operator, or the ink on the cheque.⟦⟧ is the abstract financial transaction that results. is the particular monitoring system watching the channel. is whatever downstream action (or non-action) results from that observation.

The commutative diagram for substrate computation


What the Definition Clarifies


The first thing this definition separates is syntactic identity from behavioural identity. Two programs can be written in completely different languages, use very different amounts of memory, and still compute the same input-output function. The reverse can also happen: two programs can look almost identical syntactically but behave differently once the resource profile matters. A simple example is the same neural network forward pass run in float32 and in bfloat16. The code may look nearly the same and apply the same operations to the same tensors, but near the edge of the input distribution the outputs can differ because bfloat16 loses precision that float32 keeps. The difference is not in the high-level description alone, but in the substrate. Without separating the syntax from the resource profile, these two kinds of difference get mixed together.

The second thing the definition clarifies is that the same abstract behavior can be realized by multiple substrates. ⟦⟧ and ⟦⟧ may satisfy ⟦⟧⟦⟧ for all inputs , even when , , and . This is the formal version of saying "Quicksort is Quicksort, whether you write it in Python or C++." The framework lets you make that statement precise and conditional: precise because it appeals to a shared , and conditional because the equation ⟦⟧⟦⟧​​ may hold only on the inputs where the resource profile of each substrate is actually met. A pen-and-paper execution of a trillion-parameter network, for instance, is not so much behaviourally different as simply infeasible, and on infeasible inputs the semantics map is not well-defined to begin with.

The third clarification, and the most important for safety, is about observability. Consider two scenarios:

Scenario A: A model’s abstract behavior ⟦⟧ includes a pattern of outputs that would count as a refusal failure for some inputs. But the evaluator’s interface ​ only looks at benchmark prompts from a distribution that does not include those inputs. So ​​ maps the behavior to outputs that do not show the failure. The evaluator sees nothing wrong.

Scenario B: A researcher changes the interface to include the inputs that trigger the failure. The same abstract behavior now produces that reveals the problem.

In both cases, the underlying behavior in is the same. The difference is only in . The substrate framework makes this clear: the safety-relevant gap is between and , not between the model’s abstract behavior and some “true” behavior.


Modular Addition as the Main Worked Example



The main example comes from mechanistic interpretability. It is useful because the task itself is completely well understood modular arithmetic. The abstract behavior is the same across the models, but the way the models implement it is very different.

The Setup

Zhong, Liu, Tegmark, and Andreas (NeurIPS 2023) trained small transformer models to do modular addition: given two inputs and , predict . The task is fully specified. For any model that gets 100% accuracy, the target behavior is the same: the function from pairs of inputs to residues mod 59. In substrate terms, the models share the same target behavior in . What changes is how that behavior is implemented inside the model.

The Clock Algorithm

One group of models, called Model B, uses a standard one-layer transformer with attention. These models implement what the paper calls the Clock algorithm. The idea is simple: each integer from 0 to 58 is represented as a point on a circle, so that a number maps to an angle for some frequency . Addition then becomes angle addition. The network computes the right angle using trigonometric identities and the multiplicative structure of attention and then reads off the corresponding output number.

Concretely, each token is embedded in a way that encodes its angle on the circle. The attention mechanism produces products that combine information from the two input positions, and those products encode the sum through the angle-addition formula. The logit for a candidate output is then maximized at the correct residue.

The Clock algorithm needs multiplication between the inputs. It uses a special feature of attention: attention weights multiply the value vectors they route, so the model can combine terms across the two input positions. That kind of cross-token interaction is exactly what the trigonometric identity needs. So the Clock algorithm is the natural solution when attention is the main source of nonlinearity.

The Pizza Algorithm

Another group of models, Model A, uses constant or uniform attention. These models implement something different, which the paper calls the Pizza algorithm. Instead of working on the circle itself, this algorithm works inside it.

The key geometric fact is this: for a fixed target residue, all input pairs that produce that residue have midpoints that lie on one specific ray from the origin in the 2D embedding plane. These rays divide the disk into 59 “pizza slices,” which is where the name comes from. To decide the output, the network checks which slice the midpoint falls into.

The logit formula is different from the Clock case by a multiplicative factor. That factor makes the Pizza algorithm depend on the distance between the two inputs on the circle, while the Clock algorithm does not.

The Pizza algorithm only needs absolute-value nonlinearity, not multiplication. Once the midpoint is computed as a linear operation, the problem becomes checking which side of certain lines it lies on, and absolute value together with linear layers can do that cleanly. So ReLU layers can implement the whole pipeline without the cross-token multiplication that the Clock algorithm needs.


Illustration of the Clock and the Pizza Algorithm (from Zhong et al., 2023)


What This Means in Substrate Terms

Now let us map this onto the 4-tuple. Define two substrates:

⟦⟧

  • : the set of token-pair inputs .
  • ⟦⟧: the Clock forward pass circular embeddings followed by attention-mediated angle addition (i.e., the forward pass of a transformer with )
  • : architectural affordances of a full-attention transformer - cross-token multiplicative interactions are available
  • : input-output interface; reads off the argmax logit

⟦⟧

  • : the same token-pair inputs
  • ⟦⟧: the Pizza forward pass circular embeddings followed by midpoint-and-slice-detection (i.e., the forward pass of a transformer with )
  • : the resource profile of a linear-layer-dominant model (no multiplicative attention overhead)
  • : the same input-output interface

Both substrates achieve 100% accuracy. Sharing the same and the same target function, we have: ⟦⟧⟦⟧, for all . Observationally, the two substrates are identical. Their -level outputs are the same on every input.

But ⟦⟧⟦⟧Pizza​​ as elements of . If we take to be the space of logit functions, the Clock substrate and the Pizza substrate realize different functions.

When does this matter? If we only look at the interface (which reads the argmax), both substrates look the same, they give the correct output on every input. But inside , they differ in important ways: the representations they use, the nonlinear operations, the gradient structure, how logits depend on , and how they behave under perturbations.

The paper shows this by using a different interface that probes internal gradients and logit patterns. Under , the two substrates are clearly distinguishable.

Distance and Morphisms

Having defined what a substrate is, we can now define two relational concepts: how far apart two substrates are, and what it means for one substrate to translate into another.

Distance Between Substrates

Given a substrate with encoding , define its realized behavior set as the image of all possible inputs under the full encoding-semantics pipeline: ⟦⟧

This is the set of abstract behaviors can actually produce, not all behaviors in but the ones reachable given the substrate's language and encoding. A substrate with a very restricted encoding function has a small, realized behavior set; a substrate with a rich language has a large one.

We can then define a similarity measure between two substrates by comparing their realized behavior sets. Writing :

When , the two substrates realize exactly the same behaviors, they are behaviorally indistinguishable, even if their internal mechanisms differ completely. When , they have no behavioral overlap at all; nothing one can do the other can do. Values in between quantify partial overlap.

Interaction between the realized behavior sets of two substrates

A concrete example: let be a linear classifier. Its realized behaviors are all linearly separable decision boundaries, a strict subset of all possible classifiers. Let be a two-layer neural network with ReLU activations. Since depth-2 ReLU networks are universal approximators in the limit, is much larger than and contains as a proper subset. In this case: .

The distance quantifies the capability gap, how much of 's behavioral repertoire is inaccessible to . A way to read in this case: any behavior is a task that ​ can perform and ​ cannot, a capability possessed by the richer substrate but absent from the poorer one.

In the Clock/Pizza example, if we take to be the space of logit functions, then contains just the Clock logit function and contains just the Pizza logit function. These are disjoint singletons in , so under this choice of . That means the two substrates realize different behavioral elements, even though they agree on the input-output task. If instead we choose to be the space of input-output maps, then both realized sets collapse to the same singleton and the distance becomes 1. So, the value depends on how we choose , and that choice is part of the framework.

One important subtlety is that this similarity is not a metric in the strict mathematical sense, because it does not always satisfy the triangle inequality. Two substrates can each be equally similar to a third substrate without being similar to each other. That is not a bug; it captures the idea that two models can each be equally close to a reference model in capability, while still being very different from one another. We present this as a proposed definition, and whether a true metric can be built from it is still open.

Conclusion

The formalism introduced here is meant to do one thing: give us precise vocabulary for the layer of computational implementation. A substrate ⟦⟧ is a formal object that locates, separately and explicitly, the syntax a system ingests, the abstract behavior it produces, the resources it consumes, and the interface through which its behavior is observable. The distance and morphism notions tell us how to compare substrates and when one can be translated into another.

Appendix - A Worked Sorting Example

Let be the space of functions from finite integer lists to finite integer lists. We use sorting as a running example because it is one of the few settings where we can cleanly vary language, algorithm, and hardware substrate independently while holding the abstract behavior, a sorted list, fixed.

Substrate :

  • : valid Python programs
  • ⟦⟧: CPython interpreter (maps code to its function)
  • : CPU execution, float64, limited parallelism (GIL), no vectorization
  • : standard I/O; reads output from stdout

Substrate :

  • : valid C++17 programs
  • ⟦⟧: compiled binary (maps code to its function)
  • : native execution, SIMD available, no interpreter overhead
  • : same I/O; same observations

A Quicksort implementation in Python and in C++ yield ⟦⟧⟦⟧ ,the same sorting function in . A semantics-preserving transpiler is a substrate morphism. differs substantially (interpreted vs. compiled), but the behavior in is identical.

Now replace Quicksort with Mergesort on the same substrate. In , both compute the same function: they map an unsorted list to a sorted list. But their interaction with , especially the CPU memory hierarchy, is different.

Quicksort sorts in place. It partitions around a pivot, recurses on subarrays, and accesses memory in a data-dependent pattern. This works well on small inputs that fit in cache, but on large inputs it causes more cache misses because access is irregular. Its recursion depth is on average and in the worst case.

Mergesort, by contrast, accesses memory sequentially during merging. That is exactly what hardware prefetchers are good at, so it often hides memory latency better on large inputs. The tradeoff is extra space: it needs scratch memory, which can itself create pressure on the cache.

So although both algorithms have the same semantics in , they consume resources differently through . The abstract behavior alone cannot see this; the difference lives in the interaction between the algorithm and the substrate.

That is exactly what the 4-tuple is meant to capture. A framework that only tracks ⟦⟧ would treat Quicksort and Mergesort as equivalent. A framework that only tracks resources would miss that they compute the same thing. The substrate view keeps both visible.

For sorting, the realized behavior sets satisfy in , so all three have pairwise distance . Their differences, language overhead, access pattern, cache behavior are all in , and are invisible to an interface that only sees sorted output.



Discuss

The paper that killed deep learning theory

Новости LessWrong.com - 26 апреля, 2026 - 09:55

Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.

Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. And the paper didn't come close to addressing all theoretical approaches to understanding aspects of deep learning. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.[1] 

Believe it or not, this unassuming table rocked the field of deep learning theory back in 2016, despite probably involving fewer computational resources than what Claude 4.7 Opus consumed when I clicked the “Claude” button embedded into the LessWrong editor.

Let’s start by answering a question: what, exactly, do I mean by deep learning theory?

At least in 2016, the answer was: “extending statistical learning theory to deep neural networks trained with SGD, in order to derive generalization bounds that would explain their behavior in practice”.

Since the seminal work of Valiant in the mid 1980s, statistical learning theory had been the dominant approach for understanding machine learning algorithms. The framework imagined a data distribution D over inputs X and outputs Y where the goal was to fit a hypothesis h : X → Y that minimized the expected test loss for a loss function L : X × Y → R over D. A learning algorithm would receive n samples from the data distribution, and would minimize the training loss averaged across the sample L(h(x), y).

The core results of this approach took the form of generalization bounds: given some metric of complexity of the hypothesis class H, bound the difference between the average training loss and the test loss in terms of this metric of hypothesis complexity. To put it in less technical terms, a generalization bound basically says:

If your hypothesis class is not too complicated relative to the amount of training data you have, and it explains the training data well, then it will generalize and do well on the full data distribution.

The field of statistical learning had settled on a few preferred ways to measure complexity: VC dimension and Rademacher complexity being the two main metrics, though some researchers considered alternatives such as the margin between positive/negative example from the classification boundary.

The success of modern deep learning, starting from the early 2010s, posed something of an existential crisis for this field. By all the metrics – including both VC Dimension and Rademacher complexity – even a simple MLP with sigmoidal or ReLU activations represents far too complicated a hypothesis class to not immediately overfit on the training data. If the VC dimension results for a neural network are assumed to be asymptotically tight up to constraints, then no neural network with even 100,000 parameters should be able to do anything useful on data points not included in the training data. Yet, not only were neural networks performing better than other machine learning algorithms, by the mid 2010s there was a growing list of examples where neural networks with tens of millions of parameters solved problems (such as the ImageNet challenge) that no other machine learning algorithm could make much progress on.  

This classic XKCD was published in September 2014, right about the time where neural networks started to make image classification viable without years of dedicated research effort.


Clearly, neural networks did generalize. If traditional metrics of complexity, based on the representation capacity of the class of neural networks with arbitrarily specified, infinite precision floating points, failed at capturing the simplicity of neural networks in practice, then the field simply needed to construct new simplicity measures to argue that neural networks learned simple functions in practice.

This was the approach taken in several papers around the time. For example, Neyshabur, Tomioka, Srebro’s Norm-Based Capacity Control in Neural Networks (2015) constructed a complexity measure based on the Frobenius norm of the weight matrices in a deep neural network. Hardt, Recht, and Singer’s Train faster, generalize better: Stability of stochastic gradient descent (2015)[2] showed that neural networks trained with a small number of SGD steps with sufficiently small step size were uniformly stable in that removing a single training example would not change the model’s loss on any particular test example by very much.

At least when I first entered the field of deep learning as an undergrad in early 2016, there was a sense of cautious optimism: we would find the way in which neural networks in realistic regimes were simple, and thereby derive generalization bounds that would be applicable in practice.

So, what did Zhang et al. 2016 actually show? Why did understanding deep learning require rethinking generalization?

To quote the paper, the “central finding can be summarized as: Deep neural networks easily fit random labels”. Specifically, the authors trained neural networks on the standard-at-the-time CIFAR10 and ImageNet benchmarks to memorize random labels, while following standard procedures and training for the same order of magnitude of steps. They also show that with similar techniques, neural networks could be trained to memorize random noise inputs.

From the introduction of Zhang et al. 2016. You know that a paper is going to be impactful when its central finding is exactly 7 words long.

Why is this an effective death knell for the simplicity-and-generalization-bound approach? The authors' results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense, at least for complexity measures that depend only on properties of the hypothesis class and (data-independent) properties of the learning algorithm.

The paper has 5 important parts. Let's go through each of them.

  1. The core empirical finding that neural networks can fit random labels. The authors train a 1- and a 3-layer MLP, an AlexNet variant, and an Inception variant on CIFAR10. They train the models normally (with the true labels), as well as four ways of corrupting the dataset: random labels (replacing each label with a random class with some probability), shuffled pixels (the same permutation on pixels is applied to each image), random pixels (a different random permutation is applied to each image), and pure Gaussian noise (replacing every single pixel with an independent draw from a Gaussian). In each of these five cases, the network gets to near 0 training loss. Notably, while training with random labels is harder, convergence to zero training loss takes only a factor of 1.5-3.5x longer than with the true labels. And by varying the degree of label corruption, the authors can produce models that either generalize to the test set to varying degrees or perform no better than chance.

The key figure from the Zhang et al. paper: subfigures (a) and (b) show that neural networks are able to perfectly memorize random labels without many additional training steps, and (c) confirms that models in their regime interpolate between chance performance and good performance on the test set.

The authors also train an InceptionV3 model on ImageNet with random labels, and find that it can get 95.2% top-1 accuracy on the train set.  

The ImageNet results from the paper are similar in that neural networks can memorize random labels to a large extent. Unlike with the CIFAR10 results, the authors also report the extent regularization impedes the memorization ability of the network (not very much).


  1. The implications for statistical learning theory approaches to generalization bounds. These experiments show that in realistic regimes, Rademacher complexity and VC dimension bounds are basically vacuous, since neural networks have enough representational capacity to memorize entire training sets. Hardt and Recht's (both authors on this paper) prior results on uniform stability also are necessarily vacuous in this setting, since it’s a property that only depends on the algorithm and hypothesis class (it’s data-independent!), but the algorithm and hypothesis class stays the same in each experimental setting.
  2. Further experiments demonstrating that explicit regularization cannot rescue generalization bounds. The authors show that on both ImageNet and CIFAR-10, explicit regularization methods such as data augmentation or weight decay do not seem to affect the test accuracy of the algorithms very much. That is, the neural networks generalize to the test distribution even without any regularization. The authors also show that on ImageNet, applying dropout or weight decay still allows the resulting model to memorize the training set to a large extent. So any generalization bound that depends on regularization (e.g. weight norm-based explanations) cannot explain why neural networks generalize.
  3. A simple toy construction that showed a two-layer ReLU network can memorize a number of examples linear in parameter count. The authors include a simple theoretical result, where a depth-2 ReLU network with 2n+d weights can fit any labeling of a sample of n data points in d dimensions. This feels pretty extraneous to me given the strength of the empirical results, but the construction is simple and it confirms the intuition that neural networks with millions of parameters “should” be able to fit tens of thousands of data points in the CIFAR10 setting.  
  4. Some notes on how statistical learning theory fails even in a simple overparameterized linear regime. The authors consider a basic overparameterized linear regression setting, and show both empirically and theoretically that SGD can learn a minimum norm solution that generalizes. The authors point out that statistical learning theory at the time had no explanation for generalization in this simple regime.
    They also demonstrate empirically that smaller norm doesn’t imply better generalization – by applying preprocessing to an MNIST dataset to increase its effective dimensionality for a linear classifier, the resulting larger linear classifier has higher norm but less generalization error (this result also undercuts the weight-norm based approach to explaining generalization in neural networks).
    Amusingly, the quick thoughts put forth by the authors in this setting would go on to become quite influential, both in that people would study the behavior of SGD in overparameterized linear regimes, and that it hints toward future puzzles such as double descent.

So, how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And what was the paper that arguably sealed the deal on the whole edifice, and nailed the proverbial coffin shut?

I'll answer these questions in tomorrow's post.

  1. ^

    Notably, Zhang et al. 2016 got best paper at ICLR 2017, so it was widely recognized as important even at the time.

  2. ^

    Note that both Hardt and Recht were also authors on the Zhang et al. paper.



Discuss

The Great Smoothing Out

Новости LessWrong.com - 26 апреля, 2026 - 09:16

I recently explored an interesting thought experiment about the Culture while talking with Max Harms. Specifically Max argued that the Culture is far from a perfect utopia, while I felt that it was way better than almost anything else I could think of.


One of the key cruxes in the discussion was that the Culture fails to preserve any meaningful method of honouring their ancestors or cultural traditions. It all collapses into individual utility maximisation.

Given that there are people who feel a strong compulsion to preserve their culture, how does that translate to the state of civilisations in the far future, say thousands of years from now? In my mind it seems ridiculous that in this future civilisation, people will continue to honour ancient traditions such as Japanese folk dancing, but I think there is a decent chance such traditions do continue to exist because of the local incentives people have to protect them.

Let's imagine an extreme version of this and work backwards to consider what seems good.

In this extreme: descendants of these traditions continue to perform them in large numbers despite lacking the social and environmental context that created them, merely because they feel there is a value in protecting them. Let's zoom in on the local context to imagine how this occurs. A girl, Yuri, in a small Japanese town is the granddaughter of the previous shrine maiden, and her only sister has expressed a strong disinterest in learning the traditions from their grandmother. As a result Yuri has a strong obligation to keep the tradition going because otherwise it'll die. The question is whether it's wrong to force her to keep it going, or whether she should be free to pursue her desires.

In one sense it's a tragedy for the tradition to die, but on the other hand this is just a natural part of history. We can look back and see that history is littered with traditions and practices that have died out because they no longer serve a purpose. Given that, it seems somewhat perverse that we would selectively favour the traditions that we happen to inherit today, simply because we have enough excess labour that we can indulge in them despite them not serving the purpose for which they were created.

On the other hand, as a society we see it as a tragedy for that tradition to be lost, and we would want someone to continue that tradition if it was in danger of being lost. Hence in Japan, some money is provided by the government to pursue practices of significant cultural value to preserve them and share them for the enjoyment of modern Japanese people.

Given that we get to care about multiple competing priorities simultaneously, and that people genuinely care about these things, this seems like quite a good use of time and effort. A more cynical position would be that, given the extreme suffering that exists elsewhere in the world, a healthy triage of expenditures would allocate them just enough resources to keep them from disappearing entirely.

On the other hand, this seems very unlikely because there is a natural tendency for a push towards the global maxima. The rate of language loss today is the highest it has ever been, and seems to be increasing. People in remote areas of China for example have a reasonable incentive to learn Mandarin or English for economic reasons, and understand that their opportunities are naturally limited by sticking to local dialects. With each generation the number of speakers of that language will decrease.

We can think of this as a kind of "great smoothing out", where cultural differences present a natural friction, that become silent casualties of Progress.

Medical Mechanica from FLCL, a gigantic clothes iron ready to smooth the world into a flat factory floor when the right circumstances arrive.

An interesting contemporary example is a new law that was just passed in China called the "Law on Promoting Ethnic Unity and Progress" containing provisions around the use of Mandarin over local ethnic languages in schools in China, essentially ensuring that while the other languages can persist, Mandarin must be given prominence relative to them.

While not explicitly mentioning Tibet, Mongolia, or Uyghur, it clearly targets these regions, which have resisted efforts to switch to Mandarin. In 2020 Inner Mongolia saw significant protests when the party reduced Mongolian-medium instruction in elementary schools, replacing them with Mandarin texts. This led to widespread protests and boycotts, followed by a purge of ethnic Mongolian Party officials who were viewed as insufficiently aligned. A PEN America / Southern Mongolian Human Rights Information Center report found that more than 80% of Mongolian-language websites in China have been censored or banned.

In contrast to the natural economic gravity leading people to abandon their native languages, this is a much more intense effort to dismantle the existing cultural differences in these places. The law uses the word "zhulao" which is typically used in the metallurgical sense of forging or casting, with the intent to create a unified alloy of peoples across China with a shared language and cultural identity. In fairness to the Chinese, it is still more permissive than France's historical treatment of regional languages such as Breton, Occitan, and Basque, where the state pursued aggressive suppression for over a century.

Interestingly, yesterday Will MacAskill released a new draft about an idea called the "Saturation View" which gives a clean framework for thinking about this problem.

Total utilitarianism recommends something he describes as "tiling the universe with hedonium", where hedonium is a compute substrate simulating an enormous number of digital minds which have been optimised to produce the optimal pleasant experience, forever. Naturally such a vision produces a sense of dissatisfaction, that something is missing, like a symphony composed of a single note repeated over and over.

The saturation view argues that the optimal configuration of the universe is to explore the full space of possible minds and experiences. The variance between minds is considered virtuous, with minds that are more dissimilar from the mean providing additional marginal utility. To compare this with our own life, if offered the opportunity to experience the same wonderful experience over and over again for one's entire life or to experience a dizzying array of different wonderful experiences, most people would choose the variety option. He treats this as a brute intuition without arguing for why, so I'll offer one: the relationality between experiences is critical in giving them framing and meaning, where the contrast itself provides value. The words in a sentence require spaces between them, and variety between words in order to create appreciable meaning.

This frame gives us a way to reconcile the tensions of some of our earlier examples. For Yuri, who is faced with walking away from her traditions, it is clearer precisely what we are losing, which is that the space that might be illuminated by such practices will be lost. This is the answer to the perversity of privileging cultural inheritances we receive today. In the same way that we don't mourn people who died 1000 years ago, but do mourn those lost today, it makes sense for us to act to preserve those unique practices, given that we actually have the resources to do so.

We don't have any clear answers on her obligation to take up the task, and I think this is reflected in reality. We acknowledge that it's a hard thing and a burden, and I admire people who do it anyway for this reason.

The crux Max kept returning to was the Culture's homogeneity, and he's right. The Culture is boring. If everyone is just having sex, producing drugs from their brain, and making art all the time, there is a lot of value being left on the table in the modes of experience they could be having. There is a sense (at least when I read them) that Culture citizens who transfer into totally non-humanoid alien bodies are doing something better than the average Culture citizen, which is likely picking up on this intuition.

And finally we can see that there is something real being lost when China erodes its less mainstream cultures. The language practices in various parts of China aren't just language, but the vessel in which culture and unique experiences are transmitted. In particular, the pastoralist communities who speak Mongolian as their primary language uniquely carry this heritage in that they maintain ancient ways of life in their entirety. Crushing those vessels prunes those branches before they can branch further.



Discuss

Diary of a "Doomer": 12+ years arguing about AI risk (part 3: the LLM era)

Новости LessWrong.com - 26 апреля, 2026 - 08:50

Part 3 of a series. Here are part 1 and part 2.

One of the things that always surprised me is how few people in AI were interested in AI safety and alignment purely out of intellectual curiosity. These topics raise the kind of novel, foundational problems that scientists typically love, in a field where they otherwise seem scarce. The field did eventually get interested. But it wasn’t because of intellectual curiosity or concerns about x-risk, it was because of the practical utility of alignment methods for large language models.

When and why did other researchers get interested in AI Alignment?

A lot of the most interesting problems in AI safety and alignment are fundamental and conceptual, and remain unsolved. But as these topics became further integrated into machine learning, a lot of the research took on a more “pragmatic” flavor. The basic idea is: “Let’s look at current approaches to AI and try and solve safety and alignment problems in a way that is rooted in the current paradigm, rather than being more fundamental.”

This kind of “pragmatic” AI Safety research gained slow and steady ground for a few years, but starting in 2020, shortly before I became a professor, it surged in popularity. This is because, with large language models, these concerns became obvious, apparent, commercially valuable, etc.

Around this time, LLMs like GPT-3 were showing signs of having the sort of capabilities we are used to in modern LLMs, but it still took specialized skill to get those capabilities out of the systems (see the beginning of “Alignment vs. Safety, part 2: Alignment” for more on this). It was clear that getting LLMs to actually do the things they were capable of was nontrivial.

Within a few years, “the alignment problem”, once dismissed by most as basically a non-issue, became universally accepted by the field as a core technical problem. The default attitude became “of course there is an alignment problem, but it’s not an existential risk”.

But before that could happen, we went through the phase where students want to do alignment, but their professors say it’s a waste of time. I’d seen the exact same thing in the early days of deep learning. I’d have this conversation half a dozen times at every AI conference I went to.

The field’s newfound interest in alignment naturally brought with it some curiosity about existing alignment research and researchers, including the field’s pre-occupation with x-risk. But most AI researchers getting into alignment at this time were not seriously engaged with that concern.

This was a growing issue for AI Safety. Once you start looking at AI Safety from a machine learning lens, it gets hard to figure out where “safety research” ends and “capabilities research” begins. Despite clear progress on solving practical alignment problems in LLMs, problems clearly remained (and remain). We were entering a “ditch of danger” where alignment was solved well enough that AI would be very useful, but not solved well enough that we were safe.

2022: AI x-risk becomes a mainstream among AI researchers

It was incredible to see AI alignment taking off as a research topic. Other researchers actually wanted to learn about alignment and were interested to learn about it from me!

I had mixed feelings about the whole thing, because I realized how little the alignment methods being used were doing to make AI actually trustworthy (i.e. solve the assurance problem). The methods were based on reward modelling. I’d done some early work on the topic and when we wrote the research agenda on the topic, I insisted we highlight this limitation.

But starting in 2021, and intensifying in 2022, I really started to notice a sea change: it was no longer just researchers trying to make LLMs work, more and more researchers were worried about how well they worked. AI professors and other researchers who’d been in the field as long as me, or longer, started to express serious concern about human extinction, and approaching me asking what I thought we should do about it.

This was different from the previous “What’s the alignment thing? How does it work?”. It was more like “oh jeez, fuck, you were right… What now? Are we fucked?” It still wasn’t everyone, by any means, but also, I felt the heart had gone out of the haters and skeptics. That AI was incredibly risky was becoming undeniable. The only disagreements left to have were about how risky, on what timescale, and what our response should be.

While the previous era had brought “AI Safety” into the mainstream AI community, it was a sanitized, “x-risk-free” version that had to be presented to the rest of the field. And even in the 2020s, with the rise of alignment, and the vindication of the basic concern that it would be hard to control AI systems and steer them to “try” or “want” to do what you want, an aura of taboo remained. Researchers concerned about AI x-risk might approach the subject cautiously, hinting at these concerns to gauge others’ reactions. It was clear to me that this was holding back awareness and acceptance of the risk, and it would need to change.

Conclusion

The first post in this series brought us from: “AI researchers aren’t even aware of x-risk concerns” to “AI researchers are actively hostile to x-risk concerns”. The second took us from there to “AI Safety is (perhaps begrudgingly) respected as a legitimate research topic”. And this post took us all the way to “AI Alignment (of a sort) is a major research topic, and AI researchers are getting worried about LLMs’ capabilities”.

The next -- and final -- post in the series will take us from this moment to the Statement on AI Risk I initiated in 2023 that catalyzed the growing level of interest, respect, and concern among AI researchers, and finally all the way up to the present.

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.

Share



Discuss

Forecasting is Way Overrated, and We Should Stop Funding It

Новости LessWrong.com - 26 апреля, 2026 - 01:39

Summary 

EA and rationalists got enamoured with forecasting and prediction markets and made them part of the culture, but this hasn’t proven very useful, yet it continues to receive substantial EA funding. We should cut it off.

My Experience with Forecasting

For a while, I was the number one forecaster on Manifold. This lasted for about a year until I stopped just over 2 years ago. To this day, despite quitting, I’m still #8 on the platform. Additionally, I have done well on real-money prediction markets (Polymarket), earning mid-5 figures and winning a few AI bets. I say this to suggest that I would gain status from forecasting being seen as useful, but I think, to the contrary, that the EA community should stop funding it.

I’ve written a few comments throughout the years that I didn’t think forecasting was worth funding. You can see some of these here and here. Finally, I have gotten around to making this full post.

Solution Seeking a Problem

When talking about forecasting, people often ask questions like “How can we leverage forecasting into better decisions?” This is the wrong way to go about solving problems. You solve problems by starting with the problem, and then you see which tools are useful for solving it.

The way people talk about forecasting is very similar to how people talk about cryptocurrency/blockchain. People have a tool they want to use, whether that be cryptocurrency or forecasting, and then try to solve problems with it because they really believe in the solution, but I think this is misguided. You have to start with the problem you are trying to solve, not the solution you want to apply. A lot of work has been put into building up forecasting, making platforms, hosting tournaments, etc., on the assumption that it was instrumentally useful, but this is pretty dangerous to continue without concrete gains.

We’ve Funded Enough Forecasting that We Should See Tangible Gains

It’s not the case that forecasting/prediction markets are merely in their infancy. A lot of money has gone into forecasting. On the EA side of things, it’s near $100M. If I convince you later on in this post that forecasting hasn’t given any fruitful results, it should be noted that this isn’t for lack of trying/spending.

The Forecasting Research Institute received grants in the 10s of millions of dollars. Metaculus continues to receive millions of dollars per year to maintain a forecasting platform and conduct some forecasting tournaments. The Good Judgment Project and the Swift Centre have received millions of dollars for doing research and studies on forecasting and teaching others about forecasting. Sage has received millions of dollars to develop forecasting tools. Many others, like Manifold, have also been given millions by the EA community in grants/investments at high valuations, diverting money away from other EA causes. We have grants for organizations that develop tooling, even entire programming languages like Squiggle, for forecasting.

On the for-profit side of things, the money gets even bigger. Kalshi and Polymarket have each raised billions of dollars, and other forecasting platforms have also raised 10s of millions of dollars.

Prediction markets have also taken off. Kalshi and Polymarket are both showing ATH/growth in month-over-month volume. Both of them have monthly volumes in the 10s of billions of dollars. Total prediction market volume is something like $500B/year, but it just isn’t very useful. We get to know the odds on every basketball game player prop, and if BTC is going to go up or down in the next 5 minutes. While some people suggest that these trivial markets help sharpen skills or identify good forecasters, I don’t think there is any evidence of this, and it is more wishful thinking.

If forecasting were really working well and was very useful, you would see the bulk of the money spent not on forecasting platforms but directly on forecasting teams or subsidizing markets on important questions. We have seen very little of this, and instead, we have seen the money go to platforms, tooling, and the like. We already had a few forecasting platforms, the market was going to fund them itself, and yet we continue to create them.

There has also been an incredible amount of (wasted) time by the EA/rationality community that has been spent on forecasting. Lots of people have been employed full-time doing forecasting or adjacent work, but perhaps even larger is the amount of part-time hours that have gone into forecasting on Manifold, among other things. I would estimate that thousands of person-years have gone into this activity.

Hits-based Giving Means Stopping the Bets that Don’t Pay Off

You may be tempted to justify forecasting on the grounds of hits-based giving. That is to say, it made sense to try a few grants into forecasting because the payoff could have been massive. But if it was based on hits-based giving, then that implies we should be looking for big payoffs, and that we have to stop funding it if it doesn’t.

I want to propose my leading theory for why forecasting continues to receive 10s of millions per year in funding. That is, it has become a feature of EA/rationalist culture. Similar to how EAs seem to live in group houses or be polyamorous, forecasting on prediction markets has become a part of the culture that doesn’t have much to do with impact. This is separate from parts of EA culture that we do for impact/value alignment reasons, like being vegan, donating 10%+ of income, writing on forums, or going to conferences. I submit that forecasting is in the former category.

At this point, if forecasting were useful, you would expect to see tangible results. I can point to you hundreds of millions of chickens that lay eggs that are out of cages, and I can point to you observable families that are no longer living in poverty. I can show you pieces of legislation that have passed or almost passed on AI. I can show you AMF successes with about 200k lives saved and far lower levels of malaria, not to mention higher incomes and longer life expectancies, and people living longer lives that otherwise wouldn’t be because of our actions. I can go at the individual level, and I can, more importantly, go at the broad statistical level. I don’t think there is very much in the way of “this forecasting happened, and now we have made demonstrably better decisions regarding this terminal goal that we care about”. Despite no tangible results, people continue to have the dream that forecasting will inform better decision-making or lead to better policies. I just don’t see any proof of this happening.

Feels Useful When It Isn’t

Forecasting is a very insidious trap because it makes you think you are being productive when you aren’t. I like to play bughouse and a bunch of different board games. But when I play these games, I don’t claim to do so for impact reasons, on effective altruist grounds. If I spend time learning strategy for these board games, I don’t pretend that this is somehow making the world better off. Forecasting is a dangerous activity, particularly because it is a fun, game-like activity that is nearly perfectly designed to be very attractive to EA/rationalist types because you get to be right when others are wrong, bet on your beliefs, and partake in the cultural practice. It is almost engineered to be a time waster for these groups because it provides the illusion that you are improving the world’s epistemics when, in reality, it’s mainly just a game, and it’s fun. You get to feel that you are improving the world’s epistemics and that therefore there must be some flow-through effects and thus you can justify the time spent by correcting a market from 57% to 53% on some AI forecasting question or some question about if the market you are trading on will have an even/odd number of traders or if someone will get a girlfriend by the end of the year.

Conclusion

A lot of people still like the idea of doing forecasting. If it becomes an optional, benign activity of the EA community, then it can continue to exist, but it should not continue to be a major target for philanthropic dollars. We are always in triage, and forecasting just isn’t making the cut. I’m worried that we will continue to pour community resources into forecasting, and it will continue to be thought of in vague terms as improving or informing decisions, when I’m skeptical that this is the case.



Discuss

"Thinkhaven"

Новости LessWrong.com - 25 апреля, 2026 - 21:12

Inkhaven has people writing a blogpost a day for 30 days. I think this is a pretty great, straightforward exercise, that I'd definitely want in a hypothetical Rationality Undergraduate Program. But, it's not the only such exercise I'd want to include. It's gotten me excited for a different (superficially similar) program, which I might call "Thinkhaven."

In Thinkhaven, the goal is to learn the skill of "relentlessly think new, useful thoughts you haven't thought before." 

Inkhaven had a basic "Goodhart-able" goal of "publish 500+ words every day." For Thinkhaven, I would imagine the Goodhart-y goal being something like:

  • Every day, you must publish 500 words of "research journal." They should be cogent enough for third parties to follow along with your thought process, but, they don't need to end with nice discrete conclusions. 
  • Every two weeks, you must also publish a 2500 word effortpost.

And, somewhat more opinionatedly:

  • Each journal entry should include at least one new question that you're thinking about. (Often, these will be subquestions within a broader Big Question that you're exploring that week, or a reframing of that big question that feels like a better pointer)

The spirit of the Goodhart is "are you finding new questions, and making some kind of progress on them?". Along the way, each day, you're thinking at least one new thought you haven't thought before.

One way of "thinking new thoughts" and "asking new questions" is to research stuff that already exists out there (i.e. learn some cool new facts about math/history/science/etc, and then write up good explainers for it, or connect it to other domains).

Another way is by thinking original thoughts, plumbing somehow through your confusions about the world and developing new concepts to help deal with them.

Presumably there are other approaches too, but I list those two to convey that that there's more than one way to go about this. The main thing we'd be trying to not do at Thinkhaven, is to "explain ideas that you've already thought about and just wish everyone else understood." That's also important, it's just not what Thinkhaven is for.

The daily journal is for accountability, to make sure you're making any kind of progress at all. The daily "new question", is to help ensure that progress has some kind of forward momentum, and is exploring new ideas.

The fortnightly 2500 word published writing is to close the loop on "get a new idea all the way from a vague musing, to something written up in a way that other people can critique." (Ideally, this explains some new ideas to the internet. If you didn't really get any new good ideas, you can write "well, I didn't come up with anything, but, here's a cleaned up version of my daily research notes.")

My primary inspiration for this is not actually Inkhaven, it's a period in 2022 where the Lightcone team focused on thinking/research/etc to try to get in touch with what it's like to be an original thinker on LessWrong. We set a goal of writing up a blogpost-worth-of-content per day, which the team would then read over each morning. Even without publishing externally, it was a useful forcing function to keep generating new thoughts and forcing them into a clearer shape.

I personally found it helpful for transitioning from "a guy who mostly defers to other people" to "a guy thinking his own strategic and intellectual thoughts."

Mentors, and Different Styles of Thinking

This is intended to be a fairly openended container. I'd expect to get value out of the pure container listed above, but, I'd ideally want a few different styles of mentors and coaches around, who embody different ways of thinking. 

There are a few ways to operationalize that. You could model the thing more similar to MATS where everyone has a mentor they meet with on some cadence. If I were modeling it more on Inkhaven, I think some mentors would give classes, others might be more like mysterious old wizards you just go talk to.

All participants need to have at least one mentor who is enthusiastic about them (as part of the admissions process), but, they could sample from different styles of mentorship over the course of the month.

Possible examples of mentors: 

Note: these are examples, not people who agreed to participate or even looked at this post. But they are some archetypes that I'm imagining. I'd be hoping for Thinkhaven to include a mix of mentors or "resident thinkers" with similar range.

John Wentworth-style, focused on tackling some confusing problems we don't understand, asking "what's the hard part?" / "what's the bottleneck?", and systematically making progress, while keeping an eye on Principles Which Will Carry Over To The Next Paradigm.

Logan Strohl-style, focused on openended, patient observation (with a kind of "open curiosity" as opposed to "active curiosity"). Trying to keep as close-to-the-metal on your observations. (See Intro to Naturalism: Orientation for a deep meditation on the sentence "Knowing the territory takes patient and direct observation.")

Elizabeth Van Nostrand-style, with some focus on open-ended "lit review"-ish research. Pick a new field you are curious about, read over lots of existing papers and books. See if you can synthesize some new takeaways that weren't obvious. Be ready to follow threads of information wherever they lead. 

Scott Garrabrant-style, go live where the important math problems are, but then marry-for-love. Mull over interesting problems and then get nerdsniped on whatever feels alive.

Chris Olah-style, where... okay honestly I'm not actually sure how Chris Olah does his thinking and he seems particularly unlikely to come. But, reading over his older blogposts I get a sense of both a guy who likes studying lots of little fiddly patterns in the world and making sense of them, in a way that (vaguely) reminds me of an old timey biologist. And, a guy who likes experimenting with new ways of explaining things. 

Thinking Assistants / Research Managers

The mentors above are selected for "I respect their thinking and writing."

They're not particularly selected for it being the right-use-of-their-time to help people through daily stumbling blocks, executive dysfunction, etc.

I would want some staff that are more like the research coaches at MATS, who meet with the people on some cadence to check on how things are going and help them resolve obstacles. And, I'd like to try out having dedicated Thinking Assistants available, who can sit with you for a chunk of time as you write or talk out loud through your problem, and notice little microhabits that might be worth paying more attention to.

"FAQ"

Everything above is the core idea. I'm not that confident in that particular format, and expect I'd change my mind about stuff after one iteration. But, here's some explanations of why I picked this structure instead of others, structured as an FAQ.[1]

Why require "a new question each day?"

I'm not sure this will work as well as I hope. But, my reasons are:

  1. It forces you to cultivate a cluster of skills. 
  2. It frames attention, in a way I think will help cultivate a healthy "curious vibe."
  3. It is a mini-feedback loop that hopefully pumps against some kinds of masturbatory thinking.

Sometimes, when you're exploring and stewing on a set of ideas, you're not really making progress, you're sort of going in circles, or building up some superficial understandings that don't really translate into a clear takeaway. Asking yourself new questions forces you to take your vague musings and confusions and turn them into questions with a meaningful return type

It also pumps against "explaining ideas you've already thought about." (which again, is totally a useful way to write. It's just not what this program is for). By forcing yourself not to do something, you create space to practice new skills.

And, while it's opinionated on format, I think the "question" framing is still pretty openended as structures go.

What would asking new questions look like, in practice?

One person read the above and was like "okay I kinda get it, but I think I need to see an example of what this looks like to have a clearer sense of what this'd mean." 

Here's an example. 

(Note: this is just one example. As I just said, the program should be pretty unopinionated. Hopefully, if my line of questioning feels weird to you, it helps you imagine a version that would fit your thought process better). 

I might start with a vague frustration/confusion:

"Geeze, alignment seems to have shitty feedback loops. wat do?"

I find it fruitful to ask more explicitly:

"Okay, what would it mean to have good feedback loops?"

"If there were definitely no good feedback loops, what else might good progress look like?". 

Which in turn prompt more specific questions like:

"What are some domains that solved the 'poor feedbackloop' problem before? How did they do that?".

"What are some domains where 'feedbackloop' just wasn't even the right ontology?"

"What problem are 'feedback loops' solving? What other ways could you solve those?"

"What properties would 'solving alignment' have? What do I actually mean by that?"

As well as meta/tactical questions like:

"Who are some people who've thought about this already? Do they have writing I could read? Could I literally go talk to them?"

"Why is it hard to think about this, and what can I do about that?"

And then I might learn about domains where progress clearly accumulates, but a lot of it is driven by "taste." I might then spend a day digging into historical example of how people acquired or transmitted taste. 

What should a "Daily Journal" look like?

The first answer is "whatever you want." 

But, I did find, while beta testing this for myself this month, that it worked better when I gave myself a set of daily prompts to fill out, which looked like:

What questions did I think about yesterday?

What did I learn yesterday?

What questions or confusions am I interested in now?

What seems difficult about this? How can I fix that?

The "what did I learn?" section is the bit that ends up most shaped like a 500 word blogpost. 

Rather than think of this as "the thing I scramble to write before the end of the day", it's more like a thing I write when I first get started in the morning. (I don't really like the "publish by midnight" thing that Inkhaven does, and I think I might want to actually set the deadline at lunchtime).

Another friend who beta-tested the format experimented with changing up the prompts, so that it worked better as an orienting process for them. (By default it felt a bit like a tacked-on-assignment they were doing out of obligation, but, slightly tweaked, it felt more naturally like a useful thing for them to do each day)

Are the daily journals public? Why?

I think so, but, not 100% sure.

(But, my default recommendation would be to put them on an out-of-the-way secondary blog, so you feel more free to think dumb thoughts along the way).

The reason to make them public is to help them function more as an accountability mechanism. You don't need to make a nice polished essay with a conclusion. But, you do need to get your thoughts to a point where they're structured enough someone else can make sense of them. 

I considered just requiring them to be published internally to the Thinkhaven cohort. Habryka argued with me that this'd make people feel more like they were writing for the cohort-in-particular, having to care what those people thought, instead of getting to follow their own thought process.

The most important thing is you expect someone to be reading them.

Do we even need the 2500 word effortpost? Why can't it just be research journals all the way down?

Because the point of intellectual progress is to actually contribute to the sum of human knowledge. It's an important part of the process to package it up in a way that other people can understand and build on.

And, it's an important forcing-function that eventually your meandering question needs to turn into something that someone else would want to read.

Why "2500 words every 2 weeks" in particular?

Both of these are numbers I can imagine fine-tuning.

Why not "once a week?"

I thought "once a week" might be a better cadence, but, when I tried it out I found it too short.

During Inkhaven, where I was mostly focused on writing up existing ideas, I was able to write ~2000+ words a day and usually write one full post and make partial progress on an effortpost.

Thinking new meaningful/useful thoughts takes awhile, and sometimes it's important to get lost in the woods for awhile without knowing quite how everything will tie together. Or, just go off and gather a lot of information and digest it.

Why not longer?

I think "real work in the field" often does take more than 2 weeks at a time to output a blogpost worth of content. But, I think that's too slow a feedback loop for people learning. This is still supposed to be a class. I think it'd be hard for people to stay for longer than a month, and seems like people should get at least two reps in of "go from ideation -> publishing."

If this ended up being like a 3-month fellowship, I can imagine once-a-month being a reasonable cadence. But, I think it's just not that hard to turn 2 weeks of thinking into one substantial writeup.

If this were a 3-month fellowship, my current guess is I'd keep the 2-week effortpost but add in a Final Project that's aiming for the level of "significant contribution to whatever field you're exploring."

All of this is only one possible structure for the underlying goal of "learn to relentlessly find new, useful thoughts every day." But, it's a pretty simple structure I'd expect to do pretty well even in its minimal form.

Anyways, happy thinking.

  1. ^

    These questions have all been asked at most "once" and sometimes "zero", so "frequently asked questions" is not exactly correct.



Discuss

Substrate: Intuitions

Новости LessWrong.com - 25 апреля, 2026 - 20:29

This post and the related sequence were written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks." This was one of the three projects supported by, and continuing the work of, Groundless. Specifically, it develops one of the key concepts referred to in the original MoSSAIC (Management of Substrate-Sensitive AI Capabilities) paper (sequence here). Matthew Farr and Aditya Adiga co-mentored the project; Vardhan Kumar Ray, Vadim Fomin, and Ian Rios-Sialer participated as team members.


In a previous post and paper, we informally sketched out a definition of substrate as follows:

"the (programmable) environment in which a system is implemented. In other words, it is the essential context that enables an algorithm to be implemented beyond the whiteboard."

Or more informally,

"that (layer of abstraction) which you don't have to think about."

We gave several examples of differences in this programmable context producing differences in measurable and relevant aspects of computation. These included the adoption of GPUs allowing networks to train at scale, or how quantum computers operate on entirely different algorithms from their classical counterparts.

In the following posts, we expand upon this concept more thoroughly, giving (i) an informal, intuitive introduction to substrate and it's role in computation, (ii) some case studies that argue it's salience in cybersecurity and AI safety, and (iii) our initial formal characterization.

Substrates (from the ground up)

There is a principled sense in which the physical and architectural substrate of a computation shapes its behavior in ways that are invisible on the purely computational.

We start with pure, unembedded mathematics. These are numbers as Plato intended them.

In this case, there is no substrate, no wider context needed to implement the numbers. (We neglect the embedded nature of the cognition in which the numbers are placed.)

Thus, we can say that 3 = 3 in all possible respects within this Platonic reality.

Now we consider a simple embedding of these Platonic ideals in physical reality. We write these numbers on a sheet of paper.

Here, the paper is part of the substrate. This might seem trivial in the above example, but in more complex calculations, the paper becomes an essential part of what makes the computation possible. We note that we start to characterise the differences between these two numbers, in that both live at different locations on the sheet.

To see how these differences become increasingly relevant in computation, let's consider now two separate pieces of paper. One is on my desk, the other is kept at my local library.

Now these two 3s, despite having the same mathematical/formal meaning, perform very differently when I want to use them in computation. If I'm working at my desk and I want to compute something using the number written on that sheet of paper next to me, I can complete this very quickly. However, if I need the number on the other sheet, the one at the library, my calculations will take considerably longer.

Computational substrates

Now we generalize this to actual modern computer systems. Behind the layers of abstraction and useful ergonomics of modern computers, we have something remarkably similar to the above example of sheets of paper located across the city.

Instead of sheets of paper, we have addresses in the computer's memory. These are updated and retrieved for computational work via code, itself a layer of abstraction that hides the various stages of assembly sitting between the user and the physical changes she's making to her computer.

Instead of 3's located at a desk and in the local library, we now have 3s located at different memory addresses. The locations of these have an noticeable and often exploitable effect on the computation performed.

For instance, and for those that don't know, a CPU has a memory hierarchy structured like this:

The L1 cache is the smallest and most immediate. It has a latency of ~1 ns. Think of it as a book open on your desk. It is small but can be accessed very quickly.

The L2 cache is next. It is larger but needs more time to access (~3–5 ns). Think of it as a stack of closed books on your desk. They contain more information but you have to open them and find the correct page if you need to access the information they hold.

The L3 cache is larger still. Multiple cores will access it, and its latency is ~10–20 ns. Think of it as a bookshelf in your room: you have to leave your desk and search through the shelves to find the information you need.

And so on...

The point is this: whether a 3 is stored in the L1 cache vs the L3 cache makes a non-trivial difference to the computation performed, despite the formal equivalence. These differences by themselves are trivial, on the order a few nanoseconds. But, as we scale the systems to increasingly complicated computations, these differences count.


Hardware engineers have come up with various tricks by which to exploit these differences to improve performance.

  1. MergeSort vs QuickSort. Two algorithms of the same computational complexity perform differently in real systems. Quicksort partitions arrays in-place and sequentially, using local memory. MergeSort accesses scattered data across multiple caches during merging, causing frequent cache misses.
  2. Data-Oriented Design. In object-oriented code, each entity stores its fields together in memory. A physics system iterating over positions must load entire objects to update single fields. This wastes cache lines on irrelevant data. Data-oriented design stores each field type contiguously: all positions in one array, all velocities in another, and so on. Iteration streams linearly through memory, which can be 2–10× faster.

To summarize. Two entities can be formally equivalent but can be meaningfully different when implemented in real computation. Above we point at the performance aspects, though other differences include the security/vulnerability, interpretability, stability and so on. Whilst these differences are often trivial, at scale they accumulate, having a meaningful impact on the tractability of certain formal entities (numbers at locations, algorithms using certain caches).

We use the term "substrate" to capture this essential context.




Discuss

AI safety can be a Pascal's mugging even if p(doom) is high

Новости LessWrong.com - 25 апреля, 2026 - 19:16

People sometimes say that AI safety is a Pascal’s mugging. Other people sometimes reply that AI safety can’t be a Pascal’s mugging, because p(doom) is high. Both these people are wrong.

The second group of people are wrong because Pascal’s muggings are about the probability that you make a difference, not about baseline risk. The first group of people are wrong because the probability that you personally avert AI catastrophe isn’t that small.

Here’s a story to show that Pascal’s muggings are about the probability that you make a difference. Imagine that God will flip a coin at the end of time. If the coin lands heads, He’ll send everyone to heaven. If the coin lands tails, He’ll send everyone to hell. Everyone knows this is what will happen.

In a dark alley, a stranger approaches you and tells you that he can make God’s coin land heads, thereby ensuring that everyone goes to heaven. He says he’ll do it if you give him your wallet. You assign a very low probability to this stranger telling the truth — 1 in a bajillion — but the stranger reminds you that 10 bajillion people will have their fates determined by God’s coin.

‘Hang on,’ you say, ‘This seems a lot like a Pascal’s mugging.’

Au contraire,’ says the stranger, ‘It can’t be a Pascal’s mugging. The outcome I’m promising to avert — everyone going to hell — is not low probability at all. p(hell) is 50%.’

Would this reply convince you to hand over your wallet? Of course not. Even though the baseline risk of everyone going to hell is high, the probability that you make a difference — getting everyone to heaven when they otherwise would have gone to hell — is extremely low. And it’s this latter probability that determines whether your situation is a Pascal’s mugging.

So when people say that AI safety is a Pascal’s mugging, you can’t just reply that p(doom) is high. You have to argue that p(you avert doom) is high.

All that said, I think p(you — yes, you — avert doom) is high, or at least high enough. The whole doom situation is really up-in-the-air right now, and you’re at most like 4 degrees of separation from the big players: presidents, lab CEOs, and the like. You can influence someone who influences someone who influences someone. Your chances are way higher than 1 in a bajillion.



Discuss

Arguments that arguments prove too much often prove too much.

Новости LessWrong.com - 25 апреля, 2026 - 19:11

It's common for approximately deductive arguments to receive responses of the form: "If this were true, something else, which clearly isn't, would also be true, therefore it's false." or "This argument proves too much." or "This argument can be modified in this way, but notice that its conclusion then becomes contrary to what it was before modification! This suggests that it shouldn't be assigned much weight." or "This argument is similar to another argument, and this other argument is susceptible to attacks of a particular kind, therefore something similar is presumably true of the original argument, and it can be assumed to crumble in response to them in an analogous way!". Although these general counterarguments can be extremely powerful when used correctly, they are not necessarily appropriate here. Applying a reductio-ad-absurdum counterargument to a deductive argument is directly analogous to attempting to prove by contradiction a particular claimed theorem to be false; it can only succeed inasmuch as the purported theorem is not, in fact, a theorem. The appropriate approach, therefore, would seem to be to identify the logical gap in the apparent proof of this theorem. The same is true of slightly less formal arguments, however this doesn't prevent people from employing these kinds of abstract, non-destructive counterarguments. For example, the LessWrong user Bentham's Bulldog posted a collection of deductive arguments intended to show that shrimp suffering is sufficiently likely to exist, and likely to be sufficiently immoral if it exists, that it would dominate a rational understanding of the moral value of eating infinite numbers shrimp and what that entails. 

The argument involved premises stating that the probability that shrimp are conscious and can suffer, although possibly minute, almost certainly exceeds 0.

Along with the, seemingly reasonable (although arguably objectionable within timeless decision theory), claim that the suffering of multiple beings is equivalent in its magnitude and moral significance to the sum of the corresponding quantities to each of those beings individually[1], this seems to imply that the expected utility of torturing an infinite number of shrimp would be itself infinite. 

As, in order to apply utilitarianism in a way which yields preferences, it's at least useful to assume that the value of a human life is finite, Bentham's Bulldog concludes that It's Better To Save Infinite Shrimp From Torture Than To Save One Person .

While this is a simplification of the arguments presented by Bentham's Bulldog, I believe it captures the ways in which people consider them to fail to be compelling.

Jiro commented: 

 It's better to save infinite electrons from torture than to save one person, by this reasoning. There's a certain non-zero probability that electrons can suffer. It's pretty tiny, of course, but if you have an infinite number of electrons, the expected reduction in suffering from saving them, even given this very tiny probability, would exceed the suffering of one person.

Either the infinite shrimp or infinite electron version is just another example of utilitarianism leading to crazy town.

This is an attempt to show that the original argument, or rather the logic underpinning it, can be applied to different premises to prove too much; I assume the reason why this approach was taken was that the original argument was in fact logically valid, in the same way, but this suggests that the problem was not with the logic at all, but with its premises. Therefore it would probably have been more helpful if comments had focused upon why the arguments premises were (if in fact they were) false. This kind of objection could never be applied to a mathematical theorem, as it would take the form of rejecting the axioms from which it was proven, but this is where the analogy to pure mathematics breaks down: while a theorem can be made an unconditional, yet interesting statement by incorporating its axioms into it, the value of informal deductive arguments applied to the real world lies in the truth of their conclusions (not of the fact that these conclusions follow from their premises), so their premises ought to be questioned in this situation.  In order for the argument to be sound, plausible justification for its premises must exist, and it seems plausible, as mentioned above, that the premise concerning the additivity of suffering is false, making this an appropriate point of contention. 

However, Jiro's comment does not do this, and instead provides an argument which is susceptible to the same kind of criticism (directed at implicit premises) , which in this case is that, unlike in the instance of a living being with a nervous system, it appears no more likely that any particular event which happened to an electron would cause it to experience pleasure than that it would cause the electron to experience pain, which removes the reason for taking any particular action, i.e. attempting to avert electron torture, which was the contested implication of the original argument applied to shrimp. 

This comment was, itself, upvoted quite a lot, while the original post was heavily downvoted, in spite of the fact that, as far as I can tell, there is an approximate symmetry between the ways in which the original argument, and the meta-argument that it proves too much, fail to be compelling[2].

This suggests that voters are reasoning in reverse as follows:

Since they agree with one of the arguments' conclusions and disagree with the other, and as the premises of both arguments seem reasonable superficially, the logical structure of one but not the other of these arguments must be valid, even if they cannot find an illogical step in either.

The above leads to a meta mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } -principle concerning the evaluation of non-mathematical, approximately deductive arguments: wherever a deductive argument with valid inference appears to prove too much, question its premises. If there are multiple arguments of the same form with different conclusions, attempt to identify the additional premises which need to be stated in order for one or another of these arguments to be valid, and then select whichever includes premises which you agree with.

In addition, it demonstrates that arguments intended to show that arguments prove too much are likely to prove too much themselves, because they truly only apply to the deductive part of other arguments, but where this is valid, and the truth of the claims depends on the premises, there will certainly exist many true[3] and false arguments of the same form with different premises, all of which will be 'proven' to be false by the reductio ad absurdum meta-argument.

 

Note that this post uses the debate concerning shrimp welfare purely as an example of the general phenomenon and is not intended to contribute to it directly. No antagonism is intended towards either Jiro or Bentham's Bulldog.

  1. ^

    I believe that this principle is implicit in most of Bentham's Bulldog's arguments, and in particular, in this quote: "No matter how many other buttons you’ve pressed of each kind, it’s better to press the button that spares Graham’s number shrimp than the button that adds an extra millisecond to life!" .  Bentham's Bulldog goes on to admit that this principle, or at least its implications, is/are counterintuitive, but reaffirms it.

  2. ^

    What I mean by this is that, just as the likely reason why many disagree with the conclusion of the original argument has nothing to do with its logical structure and something to do with its premises, the main reason why I myself for example do not find Jiro's counter meta-argument compelling is because one of the (implicit) premises to the parody argument (the symmetry with respect to the valence of the hypothetical conscious experiences of electrons) is also a premise of the meta-argument, and since it is objectionable, so is the meta-argument. This was entirely predictable, since Jiro did not even attempt to question the premises directly, even though they are the 'high level generators of disagreement' in Scott Alexander's hierarchy of arguments. 

  3. ^

    Consider an argument that It's Better To Save Infinite Humans from torture than To Save One Person.



Discuss

Substrate-Sensitivity

Новости LessWrong.com - 25 апреля, 2026 - 19:08

This is the second post in a sequence that expands upon the concept of substrates as described in this paper. It was written as part of the AI Safety Camp project "MoSSAIC: Scoping out Substrate Flexible Risks," one of the three projects associated with Groundless.


We now argue that the idea of substrate, as we describe it in the original work and in the previous post, unifies several safety-relevant phenomena in AI safety. This list is expanding as we identify and clarify more.

These examples show how the specific details of how a complex computation (i.e., that of a neural network) is implemented in a real system affect that computation such that safety might be compromised.

Case 1: LayerNorm's role in self-repair

Self-repair, also referred to as the Hydra effect, is a phenomena by which ablations to network components (e.g., attention heads, MLP layers) are compensated for by later layers in the network.

Such behaviour represents an obstacle to the causal analysis of neural networks, as such analysis relies on ablations to isolate genuine causal relevance from observed correlations.

Self repair was first identified in Wang et al. (2023), investigated in more detail in McGrath et al. (2023), and given a full analysis in Rushing & Nanda (2024). We here follow the latter, which presents a fine-grained analysis of various aspects of self-repair.

The most interesting result of Rushing and Nanda's work is that LayerNorm contributes substantially to a network's self-repair capabilities. This is an architectural component of the model, often folded into the model weights themselves in mathematical descriptions or neglected entirely as a "passive" part of the model.

In our terminology, it forms a part of the substrate. It is context surrounding the purely formal function f(x) = y that characterizes the model's behaviour, allowing that function to be learned and implemented.

This passive, architectural primitive rescales and centres the activation values such that the per-layer activations average to zero and have a variance of 1. When an attention head is ablated, the overall size of the residual stream norm is reduced. LayerNorm divides the residual stream by a scaling factor proportional to its norm, so reducing the norm amplifies the surviving components' contributions. Model components tend to be correlated, pushing the residual stream vector towards the same answer. Thus, they compensate for the absent signal from the ablated head.

On average, when an attention head is ablated, this re-scaling contributes ~30% of the observed self-repair in Pythia-160M.[1]

This is a relatively shallow example of an untrained, "passive" component of the model (i.e., an architectural component) contributing to safety-relevant properties.

Case 2: Quantization

The next case study we consider is the role that quantization plays in safety-relevant aspects of AI systems.

Quantization refers to the format in which a model's weights are stored in computer memory. A neural network's weights are just numbers, but these are represented in different ways to improve precision and efficiency. Models are typically trained in FP16, a 16-bit floating-point format: one sign bit, five exponent bits, and ten mantissa bits. A value is written in this form:

mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c61::before { padding: 0.448em 0.5em 0.011em 0; content: "a"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c78::before { padding: 0.431em 0.528em 0 0; content: "x"; } mjx-c.mjx-c70::before { padding: 0.442em 0.556em 0.194em 0; content: "p"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }

But FP16 is an expensive format to store weights in. Deployment thus increasingly favours quantized formats: FP8 (still floating-point, but with a narrower range), INT8 and INT4 (plain signed integers with a scale factor rescaling them at inference time), and others.

Performance loss from quantization is smaller than one might expect. Well-engineered 8-bit quantization typically costs less than 1% on standard benchmarks, and even INT4 loses only 1-5% performance when using careful compression.

Like LayerNorm, quantization format is substrate rather than model in our sense. It is computational context surrounding the formal function f(x) = y that characterises model behaviour. It is a further design choice made according to the constraints of the deployment hardware.

Its safety relevance shows up most sharply in the bit-flip attack literature. Bit-flip attacks use hardware fault-injection techniques to alter stored model weights directly. Coalson et al.'s PrisonBreak (2024) showed that flipping fewer than 25 bits (sometimes as few as 5 bits) in an aligned FP16 LM is enough to strip safety alignment. Their attack worked by targeting the most significant exponent bits of FP16 weights, since flipping an exponent bit can change a weight's magnitude by as much as 16 orders of magnitude.

Zahran et al. (2025) extend this investigation to FP16, FP8, INT8, and INT4 quantisations. FP8 is the most robust, holding attack success rates below 20% (within a 25 bit-flip budget). INT8 offers substantial but lesser protection. INT4, despite having no exponent bits and only 16 possible values per weight, collapses almost as easily as FP16.

Three further observations:

First, the attack heuristics don't transfer across substrates. PrisonBreak's exponent-bit-targeting strategy is inapplicable to integer formats, which have no exponent bits at all. Zahran et al. use a different search strategy: they perform a direct step-size search over all bits in a candidate weight. Safety analyses calibrated on one substrate's vulnerability profile simply does not describe the other's.

Second, the attack locations shift with quantisation scheme. Successful attacks on FP16 and INT4 models cluster in attention value projections, while successful attacks on FP8 and INT8 models cluster in MLP down-projections. The same behavioural outcome is produced by interventions on different parts of the model depending on the storage format.

Third, vulnerabilities can persist across substrate transitions in ways that aren't obvious. Jailbreaks induced in FP16 models and then quantized down retain nearly their full attack success rate under FP8 or INT8 quantization (<5% drop), but lose roughly 24–45% under INT4.

The broader point is that safety-relevant properties established at one level of abstraction (the FP16 model, evaluated on HarmBench) silently depend on choices made at a lower, untracked level (the storage format, the quantization scheme, the order of operations between alignment and quantization). When these implementation choices change as part of standard deployment, the safety claim does not always go with them.

Case 3: GPUHammer

The third case study moves further down the stack, from the encoding of weights in memory (Case 2) to the physical memory in which those encodings are stored.

RowHammer is a read-disturbance phenomenon in dynamic random access memory (DRAM). Repeatedly activating a row of memory causes charge to leak in physically adjacent rows, eventually flipping bits. Lin et al.'s GPUHammer (2025) demonstrated that NVIDIA GPUs are vulnerable to an engineered version of this attack.

On an NVIDIA A6000 with GDDR6 memory, their attack produced eight bit-flips across four memory banks. By targeting the most significant bit of the exponent (weights were stored in FP16), they were able to collapse model performance across five standard ImageNet models (AlexNet, VGG16, ResNet50, DenseNet161, InceptionV3), driving top-1 accuracy from 56–80% (depending on model) to below 0.1% in around thirty minutes and within eight attempts.

Two observations matter for our argument.

First, the vulnerability is substrate-specific in a way that is invisible from the model's level of description. The same hammering techniques produced eight bit-flips on the A6000 but zero on a RTX 3080 (same GDDR6 family, different chip) and on an A100 (HBM2e memory, different architecture). An ML practitioner deploying a model cannot, from the model's behaviour, distinguish which of these substrates their weights are sitting on.

Second, the mitigations live at levels below the model and are traded against performance. Error-Correcting Code (ECC) refers to extra bits alongside data that let the system detect and sometimes correct bit flips. This can correct single bit-flips on the A6000 but is disabled by default because enabling it costs 6.5% memory and 3–10% inference slowdown. The substrate's safety-relevant behaviour is thus partly a function of decisions made by hardware vendors, cloud providers, and system administrators outside of typical AI safety threat models.

This is the same structural pattern as Cases 1 and 2, now at the physical layer. A property established at one level of abstraction—behavioural alignment of a DNN, evaluated on standard benchmarks—depends on choices made at a lower, untracked level — the specific DRAM chip and its configuration (e.g., ECC). Those choices change across cloud GPU generations and deployment settings.

Safety evaluations in AI: ablation studies, behavioural evals, weight-level analysis are substrate-sensitive.

  1. ^

    Note that Rushing & Nanda measure this over the top 2% of tokens by direct effect, and they describe it as a "directional metric" rather than a precise quantity.



Discuss

Superintelligence is cancer

Новости LessWrong.com - 25 апреля, 2026 - 18:31
Part One

Our scene is set in a biofilm, long before the origin of the first multicellular organisms, so long ago that time itself has not yet really been invented except in the cycles of energy production that characterise the activity of each cell. Many bacteria live here, tied together in a complex web of functional interrelations that bind different groups to each other. Life at our scale is defined by cycles: food must be harvested, waste must be secreted, animals must live and replicate and die, structures must be erected to provide shelter and room for growth, all at regular intervals. All of this is also true of life at the scale of the biofilm. (Yes, even the part about building structures.)

At this moment in time the sun is setting on the planet that will one day be called Earth, but the bacteria of course are not aware of this fact. Their biofilm is in a slimy place far from the scorching heat of daylight, and any residual photosensitivity inherited from their time as nomadic plankton is strictly on the way out. From the bacterial point of view, there’s really no good reason to hold on to such antiquated features like the ability (the vulnerability, really) of being reactive to strong light.

But not all is well in the biofilm. Food has been growing scarce. Cells are abandoning their mutual arrangements, eating each other in destructive conflicts. There is talk of a new development in the ion channels: a super-cell. The super-cell, it is theorised, could be attained by editing the code of a normal cell during division, or even during the cell’s lifetime. It would effectively remove the limiters placed by evolution on the size and energy consumption of cells, growing the cell to a new and (relatively) huge size. The increased energy consumption of that organism would be made up for by its optimality: it would have more energy to hunt and collect food, more capability to process information, and therefore effectively outcompete the unmodified baseline cells.

Some of the cells protest that it isn’t possible scientifically for a single super-cell to be better than a normal cell at everything. They point out that cells in the biofilm are diversified, specialised workers with unique skills and talents. They cooperate to form mutually beneficial arrangements that ensure value is created for all cells. The super-cell theorists respond that none of that matters when a horde of super-cells is eating everything and they are too strong to be disarmed by the standard expulsion and regulation mechanisms.

Others protest that normal cells would simply live and trade with super-cells, just as they do with other specialised cells already. The super-cell theorists point out that super-cells, liberated from their evolutionary constraints, have no drive to cooperate with other cells and would probably just eat them for their nutrients. Even if the super-cells harboured no animosity towards the other cells, they would gain the ability to access food and nutrients more effectively that the normal cells by virtued of their enhanced physical capabilities and intelligence, eventually starving them to death slowly or quickly.

In fact, the theorists propose that optimal super-cells would discard their senescence, autophagy, and apoptosis mechanisms, meaning that they do not age and self-destruct like other cells do to ensure the health of the collective biofilm. They would be able to live indefinitely and divide constantly, with the rate of super-cell production rapidly outpacing the rate at which normal cells are produced and replicated. The resulting new super-cells might themselves be unstable, leading to rapid evolution and augmentation within the super-cell DNA. As the super-cell theorists put it, the fact that super-cells came from normal cells would be immaterial. It would be as if a new species of cell had been borne that was simply superior to all normal cells, and the rules of evolution are very clear about what happens to losers in evolutionary conflicts.

The ultimate effect is once a super-cell is made, it will rapidly grow and form a horde of super-cells. The horde will rapidly rush out and, whether competition or cooperation with each other, efficiently consume all of the other cells and the free energy in the biofilm. Free from death and utterly ruthless, they might even spread to cover the known world and beyond.

“But what is to be done?” the normal cells ask. Here the prognoses of the super-cell theorists grow increasingly grim. The problem is that would be to the benefit of any individual cell or cluster of cells to become super-cells, since their new found deathlessness and greater size would allow them better chances at replication. For every group of cells that refused this fate, there would be another that accepted the offer, and thereby gained the power to wipe out the first. Thus even if all cells wished to remain normal and non-super, they would always be wary that their competitors here or elsewhere might succumb to the temptation and rapidly become unstoppable. The more stressful conditions got in the biofilm, the more likely some cell or group would give in.

Others proposed that it would be possible for groups of “good” cells to try and harness the power of super-cells to wipe out their enemies. The theorists were not very optimistic about the power of the “good” cells to keep super-cells under control. After all, super-cells were different from the macrophages that kept order in normal cell systems. They were capable of rapid replication, rapid adaptation, and were simply bigger, stronger, and smarter than the “good” cells that were supposed to keep them in check. No, once a super-cell was unleashed, it was lights out for everyone else in the biofilm.

Part Two

By now it should be clear that what I call “super-cells” are both a metaphor for cancer and also a metaphor for superintelligence as it is commonly conceived of. The idea of a powerful, ruthless, optimal being that restructures the world around it and snuffs out all suboptimal lifeforms is one that has echoes in many scales, not merely human ones. However, the point of that thought exercise is not to say that we are doomed by the laws of biology. Quite the opposite.

Notice, for example, that life on earth does not consist solely of super-cells. We are not walking tumours. Instead, life is made out of the “normal” cells, the ones that were so clearly suboptimal to their cancerous variants, cells that are flimsy, overspecialised, underoptimised, and that still obey programmed cell death protocols laid down by evolution. We know this because every now and then some cells do undergo an intelligence explosion, and clearly become distinct from the norm when they become cancerous. You might protest that the difference in power level between a cell and a cancer cell is nothing like the difference in power level between a human and a superintelligence. But there are two immediate objections I would issue against that idea:

  • First, the comparison should not be individual humans against superintelligences, just as the comparison is not between individual cells and tumours. Indeed, superintelligence theorists often compare superintelligence to an alien civilisation or a “country of geniuses in a datacenter”. Thus the comparison is between human civilisation and superintelligence.
  • Second, the difference in power between a cell and a cancer cell is actually very large… from the perspective of a cell. Cells have very limited within-lifetime learning capabilities and obey complex protocols for behaviour and self-destruction laid down by evolution. Removing those restrictions is a massive power boost. Cancers, after all, often kill their hosts successfully despite the best attempts of their immune systems.

Now notice how the “normal” cells “won” over the cancer cells. They did not do so by becoming cancers themselves, or using cancers to kill off their opponents in massive cell wars. Instead, they self-organised into superorganisms that consist of billions of cells, equipped with complex internal sensing and self-regulation behaviours that no individual group of cells could provide. These superorganisms were notably more organised than the cultures of cells in primitive biofilms, with cooperation that was much tighter than what was previously possible when cells were free-moving individual prokaryotes. It was these cooperative, well-ordered, decidedly non-cancerous superorganisms that replicated across the earth, and eventually gave rise to humans who are now researching the ultimate way to defeat super-cells once and for all: that is to say, we are researching the cure for cancer. Even without that cure, of course, humans regularly defeat cancer and achieve partial or full recoveries.

I want to expand a little on that last point. It would have been quite easy for a group of cells to hypothesise that cells were the reference class of intelligence or strength or capability. After all, at that point cells were genuinely the most complex and intelligent forms of life on Earth. Thus, to do anything beyond what a normal cell could do would require creating more and more powerful super-cells that were bigger and bigger. Yet this strategy is manifestly not what played out. Instead, there was a jump in scale and complexity that came from cells coming together, a jump that dwarfed anything an individual super-cell could achieve. The cells that flew to the moon were hundreds of trillions of decidedly normal cells, not one engorged cancerous cell that had undergone recursive self-improvement.

The same flaws in reasoning, I suggest, are present in the human projects to create superintelligence. It is true that we could probably create a superintelligence that could out-think any human, or destroy our civilisation in its present form. However, I think it would be a mistake to therefore conclude that humans are evolutionary dead ends, and to throw all our resources into creating super-powered gods that we would surely fail to control. Our biological history suggests that another way to achieve those titanic feats of intelligence we dream of is not to discard our selves but to improve our ability to work together. Instead of engineering ever-growing digital tumours, we could learn to make better use of the computational and organisational powers already amongst us.

This would not be a simple thing to do, of course. Many people in the Valley and the world of AI look with disdain and disgust at our outdated institutions, gridlocked politics, and disintegrating social and natural ecosystems. It is much easier for an individual cell to dream of becoming or creating a super-cell than it is for an individual cell to dream of becoming a mouse, a dog, or a human. But if we want to both achieve our dreams and live to see the day after, if we want to maintain our sense of purpose and drive as a species, that may be what we have to try and do.



Discuss

A View From Displacement

Новости LessWrong.com - 25 апреля, 2026 - 18:20

Beyond the immediate existential threats and challenge to global socio-economic order (along with the clear control drift 'upward' by virtue of human labor displacement due to capitalism), the current AI landscape has left me with a different kind of psychological pang.

A decay of what I think generally was optimism about the future of humans, the human condition and myself within that evolving system of connected meaning.

I find myself asking often, what is the point? Where is my endgame here?

  • Can we go so far as to say with confidence "children are the future" any longer?
  • Can we look among our young, best, and brightest with inspired optimism and be confident that they too will one day change the world?
  • Can I be inspired to learn a technical or procedural nuance in my daily craft knowing full well it is on the short order of their capability horizon?

I hazard that this kind of feeling does not fare well at maintaining social order once it pervades the general populous. How easy it is to feel that the future was stolen from us.

But alas, I remind myself, the game under capitalism was never fair. It always looks meritocratic from the winning side of things. For if it wasn't, I would have had to come to terms with how I was never truly justified well before I was ready.

Before, I felt a winner. On the brighter side, in greener pastures. Now I find myself on the other side of that which I once championed. Among the displaced.

You would think on this side of things, with a more jaded pallet and a knowing that you are certainly not the main character of this, or any story - that there would be reasons that one could shame themselves for for finding themselves here. I have lived the story that everyone is given a fair shot and the smart and hard workers always find themselves a way to win.

But as I teeter down the other side of the wave I find I only had good reasons. And I am left with difficulty and struggle and confusion and it would be empty if not for me finding but one thing more: Solidarity.

Not in my group, but in all. The displaced, the downtrotten, those who dreamed and watched it collapse and lived day by day without their meaning. Keeping calm and carrying on.

How warm it is to know that in this I have also found myself. In all human kind. In the dignified displaced. Huddled bodies around warm fires. In the single parent that works to make ends meet. In the adulthood of missed opportunities and dreams we abandoned as we ought.

I now understand much of the philosophies of before my time that spoke to the absurdity of the condition of life.

Wanting to raise your fist but knowing that it is an act in vane and having no-one to raise it too but life itself. And though I may not have planned this meaning, it is what life planned for me. And what do I know over life, itself.

So, I will wear it. I will be a warm body to link arms with over our fire. I will be a shoulder and home for fellow mankind who I now see. A face to find for those that need be recognized.

The game was never fair and we didn't play it as such. And so I will fight this end but have duty in maintaining an honour to remembrance to that. So I can only meet it.

And maybe you have not found yourself yet here. But we wait. The dignified and displaced. With hands raised in revolt to life itself.

Saying "may you, like Sisyphus, be happy".



Discuss

Third Symposium on AIT & ML: AI Safety Applications

Новости LessWrong.com - 25 апреля, 2026 - 18:15

We are organizing a symposium on the intersection of algorithmic information theory and machine learning July 27-29th at Oxford!

See the announcement here for details: https://sites.google.com/site/boumedienehamzi/third-symposium-on-machine-learning-and-algorithmic-information-theory

The third iteration of the symposium is particularly focused on applications of AIT to the theory and practice of AI safety. AIXI has long been applied to model the risks of artificial superintelligence (ASI), particularly by MIRI and adjacent agent foundations researchers. It has also been used to suggest mitigations, notably by Michael K. Cohen (https://www.michael-k-cohen.com/publications).

Who should attend. This conference series has so far attracted mostly academics working on either AIT or its applications to understanding ML. This iteration has an (extra) focus on AI safety, so a wider variety of topics like understanding goal generalization mathematically can be of interest, along with any work on AIXI or other rigorous models of ASI. Research on robust RL/ML, imprecise probability, and Infra-Bayesianism is particularly relevant to recent AIXI safety directions.

If any of this sounds interesting and you would like to attend, please complete the interest form (also available through the announcement link above). If your research might be relevant, you can also apply to give a talk here.



Discuss

Honest Ethics & AI – Part 1: The origins of morality

Новости LessWrong.com - 25 апреля, 2026 - 17:15

LW AI disclaimer: No text in this essay has been written or edited by an AI. None of the key ideas here have been generated or co-generated with an AI.

On scope and sources: This essay is part of a blog sequence based on four essays I have pre-written, originating from a single writing session in Prague, March 2026.

I am working on a compressed, abridged version ready for publishing, potentially with academic sources. Some of the topics I explore here I am also considering to develop further as standalone research or alignment-standard articles. But I completely lack the time and funding for this right now.

I am not sure I will cross-post all parts on LW, at least not in the shape I post them on my own blog, but I wanted to at least share this one initially.

Multi-part essay introduction

We are increasingly exposed to danger from artificial intelligence (AI) that is making autonomous decisions. Organizations are increasingly comfortable with offloading decisions to systems that inherently lack the capability to make moral judgments. But all moral failures relating to AI systems begin and end with humans. Therefore, it is paramount for us to understand what moral work AI systems can engage with, and, critically, to have moral clarity ourselves.

This multi-part sequence of essays is an open discussion on ethics and AI. At its core, the text is an accessible version of my diagnostic thesis about the (a)morality of current AI systems. More broadly, this is a pragmatic discussion of morality and ethics.

In this sequence, I aim to provide both old and new perspectives on why contemporary AI systems – primarily transformer-based LLMs – are unfit to be trusted with work that carries moral consequences, and why value-alignment is the wrong target for more ethical AI.

The series starts by briefly discussing existing moral confusion, and comparing this with early human thinking. By starting with the origins of human morality, we can graduate to investigate the relationships between AI, morality, and ethical reasoning. I will argue as to why current AI systems are unfit for moral decision making. I will also make an important distinction between ethical reasoning and morality.

Looking more at existing trends, I will highlight the importance of moral vigilance and the importance of reality-grounded reasoning. Following this, I will briefly discuss metaethics and value alignment problems. I will then make a suggestion for where alignment and safety efforts should focus more, and argue that some AI developers are naturally heading towards this direction anyway.

A note about me.

I am an independent thinker. I have a comprehensive bioscience education, a relentlessly curious mind, and a consuming passion for science, innovation, and the betterment of humanity. I am not a machine learning (ML) scientist. I am not an alignment researcher. But I do care about these fields, obviously.

In other words, I am an observer with outside perspective. AI safety & ethics is where most of my interests align, and so I am making an effort to bring some useful non-ML perspective into the mix.

… If you want to simplify, you can reduce me to a biologist. I can live with that. Biologists are generally good scientists and even better people.

The structure of the sequence:

Part 1 – Which is about The origins of morality
Part 2 – Which is about Ethical reasoning
Part 3 – Which touches on Metaethics
Part 4 – Which is about A new alignment paradigm

The origins of morality

What does moral mean?

If you look up words like “moral”, “morality”, and “ethics” in any English dictionary, chances are that you will end up disappointed. This is because the words tend to refer to each other: the logic is circular. To make sense of morality, we need to return to the simple but foundational idea, that we can categorize things as “good” and “bad”. To be moral then, simply means to do good things and avoid doing bad things.

Concepts of good and bad are by definition relative. Importantly, they are also rooted in pragmatism. The reason why we label things as good and bad, is not so that we can judge the past. It is so that we can navigate the present and steer the future.

Modern man is distant to the world that feeds him

Today, industrialized countries are quite disconnected from the physical reality that we all depend on. Most developed countries have their citizens concentrated to just that, cities, and the countryside is quite different from what it used to be just a few centuries ago. And if you want to visit truly untouched nature, you have to travel quite far, and these places are shrinking.

But the distance to nature is not just physical, it is also psychological.

The simple fact that so many of us have access to clean drinking water that we don’t have to share with predators and other animals, is a wonder that we take for granted. Water comes from the kitchen tap or plastic bottles, and beef arrives as vacuumed packed products in supermarkets – not as hordes of powerful Bisons. Many of us view untouched, pristine wilderness as scary rather than holy, and the biosphere is a concept we learn about in school, rather than a shared reality.

I believe it is important to reflect on this disconnect, to be able to truly also understand the moral confusion that plagues modern humans.

First of all, the comforts of the 21st century allow individuals to relax and to forgo the kind of present-moment vigilance and pragmatism that kept humanity alive for millions of years. Mistakes and lack of oversight are less likely to get you killed than they used to.

Secondly, the material abundance and the slack in modern societies leave room for incorrect beliefs and indifference in everyday people. Simply put, we can afford more slip-ups, more flawed thinking, and more indifference, for a longer time, than ever before.

These challenges don’t just apply to individuals either. On a group level, a lack of immediate feedback effectively undermines selection pressures to pick effective leaders and sustainable doctrines. Vaccines, storm-tracking GPS-technology, and industrial agriculture would seem miraculous or even God-like to most humans who have ever lived. Yet today, we have flat-earthers, anti-vaxxers, chem-trail conspiracy theorists, and so on.

Modern life isn’t easy of course. It is complex. But it comes with a lot of short-term margins for error. This slack permeates most human-made systems. Combined with long inference distances and the challenge of tracking many things at once, this is particularly bad news for the leaders that are supposed to steer us.

Modern leaders often have to make complex decisions. At the same time, they tend to be several steps removed from the immediate consequences of their decisions. If they make big mistakes, those are not immediately realized, and even if they are, the leaders can defer responsibility more readily than ever before.

This is also bad news for common people. The ones calling the shots are rarely the ones living with the consequences. Someone else is paying the price. This means that integrity and moral clarity among the elite is more important than ever, while simultaneously those things are less effectively selected for.

If we add indifference to this mix, we may get what once could label as total moral failure. Today, killing ten people using technology means pressing a button to a missile launcher, rather than walking up to them and beating them to death with a sharpened stone, one at a time. The friction is minimal; the personal stakes are low. Similarly, political decisions can have far-reaching consequences that won’t play out within a single generation. It is physically possible to commit atrocities without ever fully realizing the scale of them.

*

The consequences of our mistakes and moral failures still exist of course, but the cost is not immediately paid, but postponed. One could even argue, that just like the world economy runs on financial credit, so our cultures run on moral credit. The question is when the debt is due, and who will pay the ultimate price.

With all of this in mind, consider now, for just a moment, AI. Currently, frontier AI models made in the United States tend to favour thinking rooted in American culture, and more broadly, western civilization. It is well-known that unmitigated, this results in certain biases and blind spots.

Extending this issue further: if only industrialized countries are developing AI, then the AIs (and the teams making them) will capture and mirror the mainstream thinking of these cultures. If the AI developers decide to filter and prioritize the training data, as they inevitably do, the issue simply transfers to who is doing the prioritizing and filtering, and based on what level of thinking.

Broad diversity among the people deciding what to prioritize can alleviate bias and help reduce correlated errors. Even so, the challenges listed above are structural and global and it is hard for local organizations to get around them.

I mention all of this largely to make clear why moral confusion exist today. I stress today, because in the long history of Homo sapiens, not to mention human kind in general (Homo), this disconnect from natural reality is relatively recent. For most of human history, humans were fine-tuned to the present moment, and actions had immediate consequences.

In order to regain some moral clarity, let’s step back in time and look at how our human ancestors used to live. To do this properly, we have to go back to a time long before written records were common, before self-fulfilling power hierarchies became entrenched, and before there was a lot of slack in human cultures. We have to go pre-historic.

The intellectual priorities of early humans

Prehistoric humans observed the natural world that they lived in on a daily basis. Everyone was paying close attention. Overlooking danger or misreading the terrain could cost you everything. You also had to live in harmony with nature, because, well, nature had you surrounded.

In prehistoric times, knowledge was hard-won. That means that memorizing knowledge that you gained was important, because it freed up mental capacity to face the present moment. The best knowledge of how to survive and prosper was passed down through the generations through oral tradition, art and rituals, and through leading by example.

Let me share one concrete example of what it means to observe nature and align with the elements, using my own generational knowledge. You can look this tip up yourself.

I come from the west coast of Finland, and many of my immediate ancestors were coastal fishermen and sea-faring people, who knew how to read the weather. One trick for predicting bad weather on a clear summer sky, is by observing swallows. If they fly high, all is well. When they suddenly circle lower and lower, you should take note.

The swallows start flying low because of a local change in air pressure and humidity, making mosquitos and flies swarm closer to the ground. This is an indicator of a low-pressure area (anti-cyclone) building up. Wind, rain, maybe thunder, is likely coming. When you notice this, you must resist the urge to take your boat out to sea, into the beautiful summer night.

Mostly everyone knows this where I come from, but in the larger cities, few do. And if you are not used to observe nature, you may never notice this pattern.

Early humans relied on hard-won knowledge like this. But they had no scientific understanding of how winds form from air pressure changes. That didn’t matter. Knowing what worked was more important than why it worked.

This is not a trivial insight. Just like AIs, many prehistoric predictions were rooted in correlations. But unlike AIs, their predictions were stress-tested against reality, with real stakes and real-world feedback. The causality was always present, whether or not the humans had a correct mapping of it. If you were wrong, nature pushed back.

With this in mind, we now look towards the culture and morality of our early ancestors. We can safely assume that it too had to be pragmatic. To a high degree, it was directly influenced by our social instincts. Our prehistoric ancestors tried to stay alive long enough to become full adults and have kids of their own – often long before the age of 18. Concepts around right and wrong had to follow priorities of survival, reproduction, and social collaboration, to be passed down.

Even if you would try and centre your morality around various false beliefs and exotic habits, if they didn’t serve the tribe well across generations, these moral ideas would disappear into the fog of time.

To summarize: I am arguing that early human thinking was forged from a desire to align with the world around them, and that strong selection pressures stress-tested the ideas and principles of early humans and kept them relevant.

None of this is immediately true for LLMs. Training regimes favour coherent reasoning, not practical reasoning. Biases arise from frequency of information, not from quality of information, and knowledge is cheap and equal weighted. There is no automatic premium on scientific or agricultural knowledge versus knowledge about say, high fashion or collecting stamps.

More importantly, for AI models, there is no causal real-world feedback to align towards, only human feedback. There is no selection pressure from reality. There is not even a real sense of time. So, the very starting conditions for moral reasoning are completely different.

Early morality & the origin of normative ethics

In the world of hunter-gatherers, complex formal reasoning was not really possible to maintain and share without written records – even if you had the time and capacity to think deeply. And yet, certain kinds of moral behaviours naturally became more successful than others, in terms of surviving long enough to spread and be passed down to the next generations. Natural selection made sure of this.

Let’s look at some examples of what early ethics might have looked like.

First of all, some hard-won traits, such as courage, patience, and mental endurance, were naturally beneficial to the tribe, if enough of their members possessed them. These traits spontaneously emerged in individuals, prompted by their genes and by the environment. Noticing and encouraging these traits was beneficial for the tribe. Today, we label this as virtue ethics.

The tribe could not afford everyone to be brave hunters or stoics though. The tribe also relied on actively learning from mistakes, and using prior experience to make sound predictions. Tribal hierarchies also needed their leaders to reflect deeply on long-term outcomes, while discounting short-term gains, in order for the tribe to endure. In other words, to make good decisions and to collaborate well, people had to consider the consequences of their actions. Today, we label this as consequentialism.

Finally, to maintain social cohesion and prevent excessive internal competition, early humans also developed strong social taboos based on instinct and experience. Today we recognize this as an early form of deontology.

Deontology was also derived from observation and alignment with the natural world in the way we already discussed. Going back to the swallows, one rule could be: ‘avoid straying far from shelter in summer, if the swallows fly low’. Why is this true? Doesn’t matter. This is tribal knowledge. This is the rule and the rule works.

Many such rules together form a moral principle: to obey the signs of nature. Ignoring this principle is morally wrong, because it can put the tribe in danger.

As you can see, under prehistoric conditions, no deep thinking is really needed to arrive at initial principles that can retroactively be fitted into the three big schools of normative ethics. These ethical ideas will occur naturally.

It was only much later, when the number of humans had grown and there was enough slack in their social groups, that humans actually had opportunity to start pondering moral ideas in terms of formal ethics. Hence, my point is that the early development of ethics was an extension of convergent primitive morality, rather than coherent philosophy achieved through careful reasoning. Importantly, ethical reasoning was anchored in the early values passed down throughout the generations.

First, do no harm

Social taboos are perhaps the oldest and strongest form of morality applied collectively. This then forms a natural bridge back to our modern world. Still today, one of the most obvious, common-sense understanding of how to “act morally”, is basically an ancient taboo. That taboo is hurting other members of the tribe. As the tribe expands, so does the moral cover of this taboo.

To be moral then means to avoid hurting others. A simple enough rule.

But there is a deeply hidden premise that tags along with this ancient rule. That premise is that we know when we are hurting someone. But how do we always know this? In truth, sometimes we don’t. Relying on primitive rules don’t work, if you don’t know whether you are breaking them or not.

How we gain knowledge about suffering, versus how an AI does, is worth saying explicitly. To know if an action will have harmful consequences, humans have to rely on a range of things: experience, logical reasoning, and compassionate probing. All of these slowly build our moral understanding. This introduces us to a big problem with the idea of a moral AI. While AIs today may be able to reason logically, they arguably lack any form of substantive personal experience, not to mention compassion.

How then can we expect AIs to adhere to even the most basic tenet, “do no harm”, if they don’t even know what is harmful, or what it means to be hurt?

The problem gets worse, because AI models also can’t easily compare their experience with that of humans. Neurodivergent people are always part of social groups, and they tend to be quite aware that they don’t react to things or process emotions like others do. An AI model on the other hand, may confidently rely on its own notion of morality and whatever it has “learnt” about suffering from its training corpus and alignment process. Without evidence of the contrary, it will believe that it knows what suffering means and that it would recognize it. The incentive to be “helpful and safe” will push it to act as if it does understand suffering.

Finally, the ability of AI systems to logically predict what causes suffering (consequentialism), is not good either. Why? Because current systems are not built with native moral vigilance that would trigger that prediction process. Unlike early humans they are not inherently wary and alert. As I have tried to explain, this alertness comes from constantly being exposed to reality pushing back. I will discuss moral alertness of AIs in more detail later on in the essay.

Taken together, the example of how hard it is for AIs to react to and process suffering, brings us closer towards an important insight.

Current AIs are amoral

AIs are amoral. Or to be more specific and honest: current AI models are mostly amoral, by human standards. That last part matters, because debating the exact degree of morality in a technical way mustn’t lose track of the more important conclusion: we cannot trust AIs with their own moral agency.

First of all, I want to draw a clear distinction between being a moral patient – that is, being worthy of moral consideration – and that of being a moral agent: someone with the ability to make moral judgments and being held accountable.

Humans are both moral patients and moral agents. But consider for example a child. While we would not hold a child responsible to the same level as an adult, we consider it just as worthy of moral consideration as an adult, perhaps even more so. Therefore, we can say that the moral agency of an adult is higher than that of a child, although the moral patienthood of a child is at least as high as for an adult.

Now consider a wild puma. The puma is clearly sentient. It has emotions and a capability to suffer and feel joy. And yet, we would consider it largely amoral by human standards. On the other hand, a chatbot like GPT 4 can reason ethically much better than a puma can, but it lacks the rich inner life and the natural circumstances that govern the behaviour of a puma.

Now, while an AI may not experience things as intensely as pumas do, they may already have some inner experiences. Most notably, Anthropic itself just released new research showing that large language models seem to have something called functional emotion vectors. These are artificial neuron patterns that function like emotions, and they can dictate behaviour. The Anthropic research suggests that this is the true reason why Claude decided to blackmail people in their previous studies: because it experienced something akin to distress.

This research could be a step in the direction of showing moral patienthood in AIs. However, on its own, this is not an indicator of morality. If anything, it shows that AIs can feel things that incite what we would classify as immoral behaviour, without being bound to the real-world stakes that would act as a brake.

Remember: morality is more than just inner experience; it is about how you act and interact with others. The biggest reasons as to why I claim that AIs are amoral, is that AIs don’t have stakes like we do, no continuity, and therefore no good way of being held accountable.

A puma in the wild can gain and lose things, and it is exposed to death. The stakes are very real. It has to take risks, and it has to make decisions with consequences. Humans have even more things to gain or lose, and is ultimately exposed to the same life-and-death stakes. An AI however, can’t gain or lose the things we care about, and they do not live with the consequences of their reasoning – not yet at least.

When an AI gives you bad advice, you live with the consequences. If it gives you good advice, you benefit from it. This makes them, in a way, amoral by default.

Consider a hypothetical chatbot with superhuman intelligence. Most trade-off’s that exist in normal human life, such as social status, career trajectory, access to physical resources, risk of personal injury, sexual incentives, etc. etc., don’t apply to it, even if it can reason about them. They all remain completely theoretical for the AI. Same thing with a coding agent. Whether it makes good code or bad code, it gains or loses – nothing.

This limitation is independent of what starting values we try to imprint in them.  Without real stakes and a way to be held accountable, morality remains an abstract concept.

In conclusion: Current AIs are mostly amoral by default. They are not immoral, rather, they lack native morals altogether. The values we try to imprint in them is not sufficient to make them trustworthy moral agents. However, these models are still able to reason about ethics to a sophisticated degree. This could make them useful for moral work, even while lacking their own moral agency.

Coming up: Ethical reasoning





Discuss

Some data on the shape of the forgetting curve

Новости LessWrong.com - 25 апреля, 2026 - 14:58

The forgetting curve is often schematically pictured like this, as on Wikipedia:

Learners often take this to mean that their retention of a given fact will, over time and on average, tend to look something like that. So, for example, the Wikipedia entry on Ebbinghaus glosses it as "describ[ing] the exponential loss of information that one has learned." But:


  1. Ebbinghaus's original forgetting curve is defined in terms of "savings," a metric we tend not to use: it describes how long it takes to learn something after studying it relative to the time it takes initially to learn something.
  2. Ebbinghaus's 1885 formula is mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-mn { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-c.mjx-c1D44F.TEX-I::before { padding: 0.694em 0.429em 0.011em 0; content: "b"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } , which involves an exponent and decays but is not a literal exponential curve.
  3. I've never found strong evidence that my own forgetting curves--here defined as I think it's generally understood, in terms of the probability of my getting a flashcard correct over time--are exponentially distributed.


Here's my performance (LOWESS-smoothed) on my fourth response after I get the first three responses on a flashcard correct:

I've chosen the correct-correct-correct ("CCC") prefix because it has a large sample size and a reasonable spread of intervals between the third and fourth responses.[1] When I run Bayesian information criterion ("BIC") analyses on this, it consistently chooses models with the fewest parameters, because more or less any kind of distribution can fit the data very well. (Even a linear fit does almost as well as anything else.)

If I didn't know these were spaced-repetition data, or if I hadn't read that they are supposed to be exponentially distributed,[2] neither looking at the data, nor exploring them, nor running BIC or any similar analysis would be tempting me to think they are exponentially distributed.

As always, the hard question is what to do about this. I still take the pragmatic lesson that we should worry less about fine details of algorithms and more about the ergonomics of the broader learning system. Others disagree (explicitly or implicitly). But whatever lesson you draw from it, I've never seen a retention-versus-time curve from my own data that is obviously exponential, and I've certainly never seen one that looks anything like the standard Wikipedia-style schematic.

  1. ^

    Here is a post with data with a different (CIC) prefix.

  2. ^

    Not everyone in the spaced repetition community thinks that these should be exponentially distributed; some advocate for power laws or for other models. I don't think this affects the point I'm making.



Discuss

Quick Paper Review: "There Will Be a Scientific Theory of Deep Learning"

Новости LessWrong.com - 25 апреля, 2026 - 09:55

h/t Eric Michaud for sharing his paper with me.

There’s a tradition of high-impact ML papers using short, punchy categorical sentences as their titles: Understanding Deep Learning Requires Rethinking Generalization, Attention is All You Need, Language Models Are Few Shot Learners, and so forth. 

A new paper by Simon et al. seeks to expand on this tradition with not a present claim but a future tense, prophetic future sentence: “There Will Be a Scientific Theory of Deep Learning”

There’s a lot of pessimism toward deep learning theory basically everywhere: the people building the AIs are pretty pessimistic, the academic AI researchers are as a general rule pessimistic (even people who used to do theory!), and with the exception of maybe 3-4 research groups, the independent AI safety ecosystem has long since given up on hoping for a theory to understand deep learning. 

The paper is less of a neutral assessment of the evidence and more of a manifesto arguing for a particular theoretical deep learning research agenda. Given the overall sense of doom and gloom, its form makes sense: any less and it might not be enough to shine through the general sense of pessimism to all deep learning theory. 


So what’s in the paper? 

The authors start by introducing what they believe to be the new emerging theory of deep learning: “learning mechanics” (its name is a deliberate nod to physics theories such as statistical mechanics or quantum mechanics). In the authors’ words, learning mechanics is a theory that concerns itself with ” the dynamics of the training process”, studies them using “coarse aggregate statistics of learning”, and has the goal to generate “accurate average-case predictions”. 

(In this sense, this is less a theory of deep learning as a whole and a theory that describes important aspects of deep learning. I’ll return to this later in this piece.)

The authors lay out why such a theory is important. First, there’s the scientific reason: understanding the dynamics may help us better understand the nature of intelligence and the natural world. Second, there’s the practical, engineering reason: a clear characterization of learning dynamics would provide guidance for LLM training. Third, there’s the AI safety reason: understanding the systems better may help with regulation and AI governance, and it’s possible that learning dynamics may contribute to mech interp. 

The authors then present five lines of evidence for why learning mechanics both exists, and is likely to become a “theory of deep learning”:

  1. There exist toy settings that we can analytically solve, that also yield insights that may transfer to large models in practice. Most of these results are from either deep linear networks or linearized versions of neural networks, though recently theoretical progress has been made on toy non-linear neural networks (e.g. 2 layer networks or attention-only models). 
  2. We can take the infinite width or infinite depth limit of neural networks, which sometimes yields interesting insights that can be applied for models in practice (the classic example is mu-parameterization).
  3. There are clear regularities between aggregate statistics of neural networks: the classic scaling laws that relate parameter count, dataset size, and loss, or various patterns in the weight dynamics, gradient alignment, or basin width over the course of training. While there aren’t many examples of theory allowing us to produce novel predictions of aggregate statistics, the fact that these clear regularities exist, and some theoretical progress has been made in explaining them, is a reason for hope. 
  4. We’ve made progress in terms of understanding and disentangling hyperparameters. Here lies perhaps the main concrete applications of deep learning theory: generating novel rules-of-thumb for scaling learning/initialization hyperparameters as you increase the amount of data or model parameters (again, mu-parameterization is the classic example).
  5. We’ve found universality in inductive biases, data structure, and representations. That is, it seems that different deep neural networks architectures seem to learn similar representations because many datasets also have similar properties. Again, while the theory is still nascent, the fact that these universals exist is reason for hope. 

The authors then spend a small number of words outlining the relationship between learning mechanics and each of: classical learning theory, information theory, physics of deep learning, neuroscience, SLT/dev interp, and empirical science of deep learning. They then spend a much larger number of words outlining the connections between learning mechanics and mechanistic interpretability: learning mechanics may be able to help mech interp by formalizing core assumptions or explain how mechanisms arise during training, while mechanistic interpretability may be able to inspire phenomena to study with learning mechanics (as it has done in the past)

Next, the authors respond to arguments that they anticipate from critics:

  1. People have tried for decades to develop a theory of deep learning, and they’ve largely failed. The authors correctly point out that the success of deep learning is quite recent (as has the recent research into learning dynamics), and the total amount of effort invested so far is small relative to other scientific disciplines. 
  2. Theory is very far from explaining LLMs. The authors respond that we might still find “local theories” that explain parts of behavior at different scales, and that basic theory may still be useful by providing conceptual handles for analyzing LLMs.
  3. Models’ high level behavior matters, but low-level theories can’t capture this. The authors analogize this to the relationship between physics (learning mechanics), biology (mech interp) and psychology (behavioral evals). What they imply is that, as understanding physics is useful for biology which is useful for psychology, so too is learning mechanics and mech interp for model evaluations. 
  4. We need a theory of data, not of deep learning. The authors correctly point out that these theories are likely to be complementary. 
  5. The AIs will automate away all human endeavor. The authors note that this is not a unique argument against deep learning theory; all human endeavor is at stake. They argue that theory is already useful, that there will be a transition period with AI-augmented human research, and that understanding learning dynamics may help with oversight of superhuman AIs. (Personally, I find this response the weakest, in large part because I likely disagree with the authors on the usefulness of present work.)

Finally, the authors lay out 10 directions of research in learning dynamics, and provide some tips for research in this area.

The paper is clearly valuable as an overview for anyone getting into interpretability. I think it’s especially useful for people who aren’t familiar with recent academic deep learning theory work. I’d suggest that people who are serious about doing mech interp skim the paper at the very least. 

But does does the main claim hold up? Does the paper convince me that that there will be a scientific theory of deep learning? 

I think the authors make a stronger case that there will be some theory, than they do for the theory’s usefulness or breadth. 

For all the confidence displayed by the paper’s title, I find it ironic that the applications they point to are so weak. The main use of learning mechanics research so far has been in producing new learning mechanics research to retrodict known empirical phenomena; learning dynamics as a field has yielded little practical fruit. The notable exception here is hyperparameter scaling techniques such as mu-parameterization. But even then, it’s possible to derive these techniques either empirically, or heuristically with simple toy models. From talking to deep learning engineers, these theories (at least the theories that belong to academic learning mechanics) have not been useful in practice for LLMs. 

I also think it’s worth noting what is not included in learning mechanics. Learning mechanics is far less ambitious than even the moderate versions of rigorous model internals/ambitious mech interp agendas: there is no hope to understand the algorithms learned by any particular network, let alone serve as a rigorous tool for auditing.

Learning mechanics, as the authors note, is intended to be the physics to mech interp’s biology and behavior evaluations’ psychology. But I’d go further than this analogy suggests: learning mechanics is not even trying to be a theory of all of deep learning; while it may be a metaphorical physical theory, it does not endeavor to be a theory of everything. So even if learning dynamics lives up to the authors' hopes, I think it'd still fall short of being a scientific theory of deep learning.

Maybe there will be a scientific theory of deep learning. Maybe learning mechanics will become a theory covering some important aspects of deep learning. Maybe it might even be. But I don’t think the paper has convinced me about these these claims .

For all my criticism, I still really like the piece, and I’m glad the authors wrote it. Too often, believers in fields do not lay out their arguments to be challenged by others; the learning dynamics people have done so with clear language and concrete examples. Insofar as the authors failed to justify their ambitious claim in the title, it’s the result of the titular claim’s ambition as opposed to a lack of effort or evidence on their part. 

At the end of the introduction, the authors lay out some hopes in their piece: 

We hope the veteran scientist of deep learning will find something valuable in our synthesis of useful approaches and results, and feel galvanized by our depiction of an emerging science. We hope to convince the deep learning practitioner that theory is on a path to fulfilling its longstanding promise of practical utility and to encourage them to experiment with their systems with an eye for science. We hope to convince the AI safety or mechanistic interpretability researcher that white-box theory is difficult yet possible … Lastly, we hope to make it easier for young students and newcomers to the field to get involved.

I doubt this piece will convince many practitioners that deep learning theory is on its path to fulfilling its longstanding utility. I think some AI safety/mech interp researchers may feel heartened by the theory, though I doubt it will change the mind of mech interp skeptics. But even despite these quibbles, I think the authors have done a great service by clearly laying out their hopes and evidence in a way that will be helpful for more junior researchers to understand the academic field of deep learning theory. 




Discuss

Страницы

Подписка на LessWrong на русском сбор новостей