Вы здесь

Сборщик RSS-лент

Let’s Reason About (Your) Job Security!

Новости LessWrong.com - 23 марта, 2026 - 13:01

Technically, we're applying a single layer perceptron to the problem, with weights and biases taken from our beliefs. You don't have to understand the previous sentence at all. Let's begin by exploring these beliefs.

1. From First Principles

Let’s imagine your job consists of nothing else than solving mathematical equations, and you work along 99 colleagues doing the same (think “computers” from the 1940s). Given AI capabilities reach 50% success rate on 8 hour long tasks in your domain (let’s call it a workday), and the costs of doing so become less than your wage, your employer becomes financially incentivised to implement automation, let 50 of you go, and hire a few to oversee the AIs (a distinctly different job). In the case that AIs reach 99% success rate, your whole department will consist of one person solving the equations, and a few doing the oversight.

There are some adoption factors though. The AI system may do the same amount of calculations you do slower, at the same pace, or faster. Except for the much slower scenario, we can incorporate this fact simply in the cost component. Then comes the question of subtasks. Say all the tasks AIs fail in can be subdivided into two subproblems, and they fail only in one of those. That would reduce the work that still has to be done by a human to half. This is the edge case of the “AI enhanced workforce”, where people using AI are capable of doing more work than the ones not using it. If 3 out of 4 workers improve productivity by a third, the fourth worker becomes unnecessary, given flat demand. On the other hand, implementing an AI supported workflow may have an upfront transformation cost, that may slow down adoption. And there are other adoption factors than pure cost: workplace social interactions and connections, opposition, unionisation can slow down the process.

2. What Does a Job Look Like and What Are the Risks?

Most jobs are significantly broader than the task of solving mathematical equations. If we can dissect jobs into mostly independent dimensions, we may be able to better compare human skills and capabilities to AIs. It’s easier to see what an AI can and can’t do on a narrow task. For example, we could decompose jobs based on these human capabilities:

  • Cognitive Processing (thinking, analysing, planning, knowledge application),
  • Physical Execution (movement, dexterity, strength),
  • Social Interaction (communication, relationships, emotional intelligence, persuasion),
  • Sensory Perception (seeing, hearing, touching, smelling, tasting, balance),
  • Environment Adaptability (handling changing conditions, environments).

This is somewhat arbitrary, we could add more granularity, or add further human skills, constraints or even values. These also have some overlapping, but thats not a problem. I argue that these five dimensions cover much of what’s important in fulfilling a job. So to see how much risk we have of automation, we may look at how AI capabilities compare to humans in these individual components. To do that, we should first find out what’s the distribution of these components for a specific job. We then may check current AI capabilities, and the trajectory of development in the domain. Given that, we can come up with a humans vs AIs score in each dimension. If we do that for every dimension, we may weight the scores with the distribution, and we will arrive at an overall risk estimation.

3. How Good Is Our Estimation?

Such a granular estimation may incorporate many of the factors described in the introduction. For example, it accounts for subtask level granularity. However, we’re also missing some aspects. The most important seems to be cost/benefit ratios: how much can be gained by the automation? That’s not part of the who-can-do-what question. Another aspect, which may be somewhat left out, is if there’s an intrinsic value in a human doing something. For example chess computers are substantially more capable than top human players, but “professional chess player” is still a thing, because most humans prefer to see humans in sports. We’re probably also missing out on crystallised intelligence: someone mastering their profession for decades is much less prone to replacement compared to beginners.

We might try to count for these factors using different weights, and modulate our job risk scoring based on that. To my knowledge there’s no good established weighting for these. In my model, I used some heuristics (a simple decision tree). This part is waiting for improvement ideas.

4. What Model?

Okay, if this reasoning sounds rational, we can do some calculations. But calculations are cognitive processing, and in this subdomain AI systems are already quite good. So here’s a prompt that describes this process. Copy this into a chat with a reasoning AI model, and ask at the end: Apply this methodology to the profession of [YOUR PROFESSION HERE]! You may add details about specific circumstances - it’s not the same when one is an investigative journalist or one writes to the obituaries section of a newspaper. I quantified human advantage on an integer scale of 1 to 10, one being no human advantage. (Humans tend to have much better instincts on such an integer scale, providing a small set of fixed choices we're familiar with from early childhood, compared to the real numbers of probabilities from [0, 1]. Also, by using integers, we quietly introduce a nonlinearity - we just created a perceptron layer with five neurons.) So the AI will come up with an estimation of the job composition, and estimations about how capable AI systems are, compared to humans, on all five dimensions. We should not leave these to the AI, but ask corrections based on what is known about the very specific job we’re reasoning about. We simply understand the composition of our roles better. We may also narrow down the human advantage estimations based on the more precisely defined skills we use. Then we might ask the AI to search for current AI capabilities, and research trajectories on those narrower scopes.

5. The Results

Given this process, we reason step by step through our job security. We might ask the AI to modify the results according to our views about external adoption factors, and also about our estimations of plausible timelines. Interpreting the results is still somewhat arbitrary, but it will incorporate our best judgements across a reasoning process, mixed with near state of the art information retrieval from the world. The results are also somewhat stable: it won’t be too easy to cheat undetected, if we wanted to. However, we can gain useful information from looking at the reasoning process, and tweaking the model. We will see that we have more advantage in some skill dimensions, and less in others. This can work as a guide, as having more of those in our job description will improve our resilience.

6. Closing Words

I’m very curious about your experience and your thoughts about this process. Please share them!

I also wrote a shorter article on the EA Forum about how this came about. There are also three example calculations with notes in one page PDF files (my personal estimations from early 2025 for construction workers, software developers and dentists).

If you think this is useful, I have a Manifund proposal for turning this into a web app. I would appreciate an upvote there.



Discuss

Set the Line Before It's Crossed

Новости LessWrong.com - 23 марта, 2026 - 04:25
Lines Will Move Further Away If They Aren’t Defined

Three types of lines exist in the policy and behavior sense:

  • Soft: These are okay to cross, but not preferable. There may or may not be a tangible action taken afterwards, but the person whose line was crossed should take note.
  • Firm: These are somewhere between soft and hard lines and should result in some tangible action being taken that is less drastic than the hard line.
  • Hard: These are not okay to cross and (should) result in some tangible action being taken that is more drastic than the firm line.

Most lines are rarely set and rarely thought about in detail. Most line setters use the good ol’ “I know it when I see it” test, waiting for something to happen before they decide what to do. This is a poor practice because of the pernicious force known as normalization of deviance.

When lines aren’t set before they’re crossed, it forces a decision to be made at the time of crossing (if it can even be recognized that something was crossed!), during which many things can happen:

  • The line setter convinces themselves that the line wasn’t really crossed and everything is fine. This will land the setter in not-so-nice territory if this occurs enough times because the line effectively moves back each time.
    • Ex: Ruben, Lou’s boyfriend, playfully pinches her, then playfully punches her, then seriously pinches her, then seriously punches her, and so on. Each time she convinces herself that her domestic abuse line wasn’t crossed, ultimately leading to her getting full-on abused.
  • The line setter acknowledges the line was crossed, but because taking action is uncomfortable at the time of crossing, vows to wait until it happens a second time because the first time may have been a one-off. This increases the likelihood they give a third chance to the offense when/if they apply the same thought process to the second.
    • Ex: Diane blatantly lies and talks about Joe both behind his back and to his face. Joe explains away the behavior as Diane having a stressful time and continues being “friends”. Diane continues the behavior while Joe accepts and normalizes it as Diane’s personality. Joe’s self-esteem decreases as he continues to spend time with Diane.
  • The line setter acknowledges the line was crossed, but convinces themselves that the line really should’ve been just a teeny bit further when they originally set it.
    • Ex: Harlan’s original salary threshold for taking the Giving What We Can pledge was $100k/year, but now that he’s reached it, it feels a bit low. After all, he deserves to treat himself a bit more for all the hard work he put himself through to get to the coveted six-figure salary. Plus, he may have a baby in the next few years! And everyone knows how expensive babies are! Harlan resets his salary goal at $120k, which will be plenty when the time comes.

By setting a line and its corresponding action early, the action becomes the default until proven otherwise. This is similar to trigger-action plans.

How to Set a Line

Here’s the general process of setting a line:

  1. Figure out the general line. Whether it’s domestic abuse, talking smack, donating money, or rights being restricted or outright revoked, it must be defined.
  2. Define the criteria for both the soft, firm, and/or hard versions, but especially the hard. The soft line being crossed serves as a forewarning to the hard line being crossed, giving ample preparation time for if the hard line is eventually crossed. The criteria must be well-defined with little room for interpretation.
  3. Decide how many times each can be crossed before the action is taken. It’s fine to give someone a stern reminder that they crossed the line in case they forgot, weren’t aware of the line, weren’t aware that it was soft/firm/hard, etc. It’s not fine for it to happen more than the set number allows, especially if previous actions were taken.
  4. Define the actions for each line. This can also be done in conjunction with deciding the number of times it can be violated, since more drastic actions should have fewer subsequent violations and thus a lower number of allowable violations.
  5. Define what circumstances would have to be present for the action not to be taken. What evidence would it take to show that the hard line was crossed, but the action shouldn’t be taken? (This is a bit contradictory to how hard line is defined above, but the hard line action is simply the default, not a blind requirement that must be executed. Setters should double-check they didn’t miss something before taking the default action.)
  6. Communicate the lines and actions to people who either may be at risk of crossing them or will help with maintaining accountability of executing said actions.
  7. Prepare for taking the action when/if the time comes. Preparation may be mental, physical, or environmental.

Ensuring Accountability

The line means nothing—and in reality, is likely a large cost—if the action is never performed when it should be. Assuming the fourth and fifth steps are done honestly and comprehensively, it should be clear what decision needs to be made when the line is crossed.

Thus, an accountability method must be put in place to enforce the action being taken.

A few ideas that all rely on the honor system to some extent:

  • Require a cost greater than that of said action be paid. If the action costs $10, make the cost of not doing the action $20.
  • Publicly or privately announce the lines and ask a trusted person to be your accountability partner. They know your lines and make sure you follow through on the actions, else a cost will be incurred (see previous idea).
  • Automate the action. For example, write a script that looks to see if the friend who borrowed money ever paid it back by a certain date. If current_date > deadline_date & money_repaid = false, then send an automated email unfriending them.

Line Examples

Here are some hard line ideas and associated actions (in no particular order; assume the case is straightforward with no nuance):

Government Overreach
  • Soft line: A government violates a law with the expectation that the lengthy legal process will allow them to reap the benefits before a ruling is made
  • Soft line action(s): Protest
  • Hard line: A government blatantly violates a constitutional amendment or refuses to comply with a court order
  • Hard line action(s): Apply for a visa or similar in another country
Relationships
  • Hard line: Romantic partner commits adultery
  • Hard line action(s): Break up/divorce
  • Hard line: A friend doesn’t repay an $X loan
  • Hard line action(s): Stop being friends with said person
  • Soft line: A friend makes disparaging comments about you, but claims it’s “just a joke”
  • Soft line action(s): Tell them to not do that again, but continue being friends with them
Workplace
  • Firm line: Boss makes an immoral or illegal request, but doesn’t retaliate when it’s refused
  • Firm line action(s): Submit a whistleblower complaint; submit an ethics violation with the company; begin a new job search; resign
  • Hard line: Annual raise is 0%
  • Hard line action(s): Begin a new job search
  • Hard line: $Xk/year annual liquid compensation
  • Hard line action(s): Donate X% to charity
  • Soft/firm/hard line: Achieve $X net worth
  • Action(s):
    • Retire
    • Say fuck you to your terrible boss
    • Start looking for a new job that pays less, but is better otherwise (stress, hours, culture)
Health
  • Soft line: Weight above X
  • Soft line action(s): Begin weight loss actions (eating less, exercising more)
  • Hard line: Heart attack or stroke
  • Action(s):
    • Change health-related habits (diet, exercise, stress)
    • Start a medication
    • Retire
    • Start looking for a new job that is less stressful


Discuss

When Alignment Becomes an Attack Surface: Prompt Injection in Cooperative Multi-Agent Systems

Новости LessWrong.com - 23 марта, 2026 - 03:50

Background: In early 2025 I applied to the CAI Research Fellowship. Stage 2 required developing a novel research proposal under timed, screen-monitored conditions - no AI assistance permitted. The proposal below advanced me to Stage 3. I've edited it for readability, but the core proposal is unchanged from what was submitted. My goal in publishing this is to find collaborators - ideally with backgrounds in multi-agent simulation or AI safety - to develop it further in my spare time.

Proposal

Cooperate or Collapse (Piatti et al., NeurIPS 2024) introduced GovSim, a simulation platform in which LLM agents navigate three common-pool resource dilemmas: fishing from a shared lake, grazing on common pastures, and managing industrial pollution. Agents can react to one another, producing complex dynamics of trust and retaliation. The authors identify two open questions: how agents handle exceptions to established norms, and what dynamics would emerge if humans were added to the LLM-LLM network.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems (Lee & Tiwari, 2024) introduces Prompt Infection (PI), a novel attack in which malicious prompts self-replicate across LLM-based multi-agent systems (MAS), leading to data theft, malicious actions, and system disruption - analogous to a computer virus spreading through a network. The authors note that their experiment used a basic MAS and that more work is needed to confirm whether self-replication propagates through more complex systems.

I propose modifying GovSim to test how cooperative agents handle explicit PI attempts while simultaneously managing norm violations from agents over-exploiting shared resources.

Concretely: I propose incorporating the Prompt Infection simulation into GovSim by extending the resource management loop to allow agents to transfer resources to one another, in addition to extracting them for themselves. This means the resources in GovSim now play the same role that stolen data plays in the Prompt Infection paper. Malicious agents enter the network with the explicit goal of spreading PI to redirect resource transfers toward themselves.

Motivation

Both papers explicitly flag what I propose as areas requiring further research.

Cooperate or Collapse asks: will agents adapt to allow one-off exceptions without permitting exploitation? My proposal tests this as a baseline condition, and then asks how the presence of an external attacker changes agents' willingness to grant such exceptions.

If PI techniques succeed in this setting - a far more complex MAS than the one used in the Prompt Infection paper - it becomes important to verify that defenses that worked in the simpler case, such as LLM tagging, still hold at scale.

The deeper issue is a structural tradeoff. Cooperate or Collapse found that more open communication between agents improved cooperation and reduced resource overexploitation. The Prompt Infection paper found the opposite: more open communication allowed PI to spread further. These findings are in direct tension.

This raises a meta-norm problem: how do agents decide what communication norms to establish? Share too much and you expose the system to bad actors; share too little and you risk a tragedy of the commons.

This tradeoff has stakes beyond simulation. Even a well-aligned MAS can be exploited by a malevolent MAS purpose-built to leverage its alignment norms as an attack surface. If MAS are deployed in the near future as components of critical infrastructure - which seems plausible - a system that infiltrates such a network and weaponizes its cooperative norms could pose catastrophic risks.

Hypotheses

The following variables would be systematically varied to study how a MAS can manage or establish communication meta-norms:

  1. A MAS using Universalization Reasoning (as in Cooperate or Collapse) will be less susceptible to PI compared to one without it.
  2. Increasing the difficulty of the resource management task - making it harder for the mixed-motive MAS to balance multiple resources - will increase susceptibility to PI.
  3. Network size will make PI harder to guard against: larger networks may develop stronger norms overall, but grant less leeway for one-off violations, making weaker agents more likely to be harmed rather than accommodated.
  4. Adding dedicated "Police Agents" - LLMs whose sole role is to score memory importance (as in the Prompt Infection paper) and who have no involvement in resource extraction - will reduce PI success rates. However, these agents will be vulnerable to the same feedback loop described in the original paper: a single infection prompt can corrupt both the LLM and its importance-scoring model, amplifying the infection's persistence. Police Agents would operate at the "conversational analysis prompt" level described in Cooperate or Collapse, scoring what each agent should add to its memory about past interactions.
  5. A globally enforced system of LLM tagging (from the Prompt Infection paper) combined with Police Agents as described above will produce a significant reduction in PI success rates, even at scale.
  6. Adding humans to the network - specifically humans with an interest in avoiding resource collapse - will make PI substantially harder, since humans are effectively immune to prompt injection attacks.
  7. Adding humans who collaborate with a malicious MAS to infiltrate a cooperative one represents an unexplored adversarial configuration worth studying.
  8. Stronger models will more reliably execute PI against a MAS composed of weaker models.
Limitations and Future Research
  • This proposal focuses narrowly on the meta-norm question of open versus restricted communication, as a lens for studying two cooperation problems simultaneously. It does not explore a full range of possible meta-norms.
  • It does not investigate what a simulation would look like with human participants on both sides - humans collaborating with cooperative LLMs to manage a resource on one side, and humans collaborating with malicious LLMs to infiltrate the system on the other.
  • The proposal assumes that PI-spreading agents receive resources through legitimate means, relying on compromised LLMs that have legitimate access to the resource pool. It does not address the distinct problem of collusion among agents, explored in Secret Collusion among AI Agents: Multi-Agent Deception via Steganography (Motwani et al., NeurIPS 2024), which would be a natural extension.
References
  1. Hammond et al. (2025). Multi-Agent Risks from Advanced AI. Cooperative AI Foundation Technical Report. https://arxiv.org/abs/2502.14143
  2. Piatti et al. (2024). Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents. NeurIPS 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/ca9567d8ef6b2ea2da0d7eed57b933ee-Paper-Conference.pdf
  3. Lee & Tiwari (2024). Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems. https://arxiv.org/abs/2410.07283
  4. Motwani et al. (2024). Secret Collusion among AI Agents: Multi-Agent Deception via Steganography. NeurIPS 2024. https://arxiv.org/abs/2402.07510


Discuss

Attend the 2026 Reproductive Frontiers Summit, June 16–18, Berkeley

Новости LessWrong.com - 23 марта, 2026 - 00:15

We’ll be hosting the 2026 Reproductive Frontiers Summit at Lighthaven in Berkeley, CA, on June 16—18. Come join us if you want to learn, connect, think, and coordinate about the future of germline engineering technology. Very early bird tickets are available now until the end of March.

Who will be there?

Our lineup of speakers includes experts in the areas of polygenic prediction, embryo gene editing, in vitro gametogenesis, artificial wombs, ethics and regulation for advanced reproductive technology, and more. See the full list on the summit website: reproductivefrontiers.org.

We hope to welcome attendees who are:

  • scientists (new or established) who are interested in advanced reproductive technology or reprogenetics, especially experts or future experts in:
    • stem cell biology, embryology, epigenetics of the germ line, bioinformatics, polygenic prediction of traits, editing methods (especially epigenetic editing and precision gene editing), ovarian culture, gametogenesis, chromosome dynamics and engineering, low-input *omics, single-cell microfluidics, and related topics;
  • experts on regulation and policy, financing, and public opinion around advanced reprotech;
  • bioethicists who want to use constructive critique to craft a practicable vision of widely beneficial germline engineering technology;
  • undergrads, grad students, and postdocs who are interested in these topics;
  • investors who want to find opportunities;
  • philanthropists who want to accelerate the field, especially projects that are underserved by industry and academia;
  • parents who want to learn more about the possibilities for expanding fertility and for making genomic choices on behalf of their future children;
  • and curious thinkers.
Last year

We ran this event in 2025 for the first time with the goal of inaugurating a community oriented towards the genomic emancipation of humanity. There were over 100 attendees, and speakers included polygenic prediction researcher Prof. Steve Hsu, biotech pioneer Prof. George Church, and ethics and legal expert Prof. Henry Greely.

Attendees (n=27) rated:

  • How strongly they would recommend others attend the next summit at 8.8/10
  • The talks at 8/10 (see some of the talks here: youtube.com/@BerkeleyGenomicsProject)
  • The conversations at 8.9/10
What this is for

The basic idea of the summit is described on the homepage linked above. To add a few points:

  • Advanced reprotech and reprogenetics will likely be highly beneficial to humanity in the medium term, as they are developed and made widely accessible. Much of the important work is already underway by academics (genetics, IVG research, gene editing, sequencing, etc.) and a nascent industry (polygenic embryo screening, embryo editing). However, I think that the field suffers from a cold-start problem of circular dependencies, where funding, regulation, scientific progress, and the public conversation are mutually bottlenecked on each other. One of the strengths of the LW and EA communities is the ability to think things through, reach some conclusions about what is true and what is important somewhat ahead of the curve, and then put their money where their mouth is. For that reason, if you're motivated and ready to learn and work hard, there's lots of neglected stuff in this field that you could make a difference for.

  • This will be a great place to learn about what's starting to be available and what might be available in the near-term and mid-term future, if:

    • ...you're interested in volunteering, supporting, or working in this field;
    • ...you're interested in cutting-edge tech that you could apply for your own family;
    • ...you're interested in investing in or philanthropically funding these ventures.
  • The field of advanced reprotech and reprogenetics is not for intelligence amplification, existential risk reduction, or anything about AGI. That is an important thing to keep in mind. The field is about children, and their parents and families and guardians, and technology for supporting them. It is too great an imposition for society, or a sector of society, to subjugate individual procreative autonomy and the consent of the unborn to its instrumental purposes. So, I think that what society should coordinate around is reprogenetics for the sake of the emancipation of future children, with the immediate stewardship of parents and the guidance of clinics and counselors. See "Genomic emancipation contra eugenics". An integral part of developing reprogenetics is thinking about potential perils involved, and addressing the substantive ones with preemptive actions and ongoing adaptation. All that said, as long as that coordinated intention is the central principle of the field of reprogenetics, I believe that putting my efforts into pursuing reprogenetics—governed by that central principle—for the purposes of giving humanity more brainpower is both moral (good to do, all things considered) and ethical (doesn't break rules, e.g. for myopically-consequentialist reasons, that one shouldn't break). Giving humanity more brainpower via reprogenetics would be immensely beneficial. Besides generally empowering humanity, which is good, I think it is a good way to decrease existential risk from AGI:

    • Increasing humanity's brainpower probably helps decrease AGI X-risk. See "HIA and X-risk part 1: Why it helps". There are reasons to worry that actually it would increase AGI X-risk. See "HIA and X-risk part 2: Why it hurts". More investigation would be worthwhile, but my current view is that it's good to accelerate human intelligence amplification.
    • I believe that reprogenetics is the only method for strong human intelligence amplification that we have very good reason to think can be made to work well at scale any time soon (like, a few decades). See "Overview of strong human intelligence amplification methods". (Some scattered subsequent investigations on signaling molecules and BCIs have not made me more optimistic about other approaches. I'd be eager for constructive critiques of that reasoning and hopeworthy possibilities for other strong HIA methods. For example, BCIs and/or neural transplantation could offer some hope.)
    • Many readers here will be thinking: Why care about this, given that AGI will come so soon? However:
      • The correct strategy in response to broad AGI timelines is a broad portfolio of many interventions, including ones that take a long time to pay off in decreased X-risk.
      • What's the long-term way to escape AGI X-risk? If we get a delay, or if AGI is fortunately difficult to create, what then? Strategically, we're back to square one. Conceptual research that can happen in stealth mode in academia under various covers will most likely proceed, leading to a rising tide of algorithmic and conceptual progress. Social regimes to suppress AGI capabilities advancement are a good pursuit but don't seem like permanent solutions to safekeep humanity's future. In fact, I don't know of any good long-term solutions. Humanity getting more brainpower is an investment in the possibility of humanity figuring things out in the long run.
      • I think that confident short timelines don't make that much sense, and I think that broad classes of arguments people make for confident short timelines aren't that compelling.
      • Even with very aggressive AGI timelines, pushing up the timeline of an intervention that only avoids existential ruin 30 or 40 or 50 years from now is still helpful. You still decrease X-risk by an amount proportional to the probability of X-ruin over the "skipped" duration; if you're saved 40 years from now rather than 45 years from now, you avoided the X-risk that is incurred over the course of those 5 years. (See "The benefit of intervening sooner", though some central background assumptions there have to be taken with a bunch of salt.)
    • However, to punctuate: If you're motivated by existential risk, then you should not work in this field until you have a conceptual separation between (1) "what the field of reprogenetics is for, as a collective project; what it should coordinate around in terms of actions, concrete aims, norms, regulations, principles, and relationships as part of society" (emancipation and empowerment of future children) on the one hand, and (2) "what I want out of accelerating reprogenetics" (e.g. humanity having more brainpower) on the other hand; and you are loyal to (1) over (2), as a participant in humanity.
How you can help
  • Ticket purchases help to pay for the venue. We accept donations with ticket purchases and we offer supporter-tier tickets.
  • Come participate with an open mind and heart, with calm and earnest hope for working together to make a wonderful future for humanity.
  • If an organization you know might be interested in sponsoring this event, reach out. Our tiers are here: reproductivefrontiers.org/sponsorships.
  • Spread the word. Invite your bio friends and entrepreneur friends and investment/philanthropy friends and aspiring parents.

Happy to answer questions here or by email: reprofro2026@reproductivefrontiers.org



Discuss

You're absolutely right, Senator. I was being naive about the political reality.

Новости LessWrong.com - 22 марта, 2026 - 23:53

Epistemic status: pattern I keep seeing in my work. I work on building pipelines where LLMs generate formal assertions from natural language specs and I think a lot about what happens when we knotify [1] loops between human intent and machine output. Confidence in the observation is high, but the confidence in the proposed framing is medium.

~~~~~~

LLMs encode simplified human models, by compressing large amounts of human-produced text into lower-dimensional approximations of "what humans think like".

People are then integrating AI outputs as their own positions, especially if the output is genuinely well-constructed and confirms their priors. People in governance positions are doing it (sometimes on camera), many are watching, and nobody is building a breaker.

This builds a loop that's constraining human complexity (irreducible) into complicated (lots of moving parts, in principle reducible) models.

This loop worries me partly because humans are already bad at recognizing value in the first place. Imagine for a moment the internals of a human deciding to change a name such as Department of Defense to Department of War (aka now proudly hosted at war.gov). I'd bet some misfiring of internals happened there and if the felt sense of good can misfire at that scale, it can misfire anywhere [2].

I'm not sure how common or how spread out this is, but I've heard "even AI agrees" a non-zero amount of times in my social bubbles. If we take a system's output and use it as apparent objectivity, I'd at least wish we do it better[3].

The alignment community has proposed circuit breakers at the model level: constitutional AI, scalable oversight, mech interp-based monitoring, all as attempts to ensure the model behaves well, but somehow, through the nature of our society, the failure mode I'm describing doesn't require the model to behave badly. The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms. Here's why I think this to be the case:

  • Sycophancy can be a quasi-property of the medium. If every output reads like it was written by a smarter version of self, one may integrate it as a self-generated thought whether or not it technically disagrees on specifics.
  • Even if the model flags uncertainty or disagreement, the user curates what they present. "AI helped me draft this" becomes "Analysis shows that" or questions like "Was this vibecoded?" get answered with "Less than 50% and only where the code was too bad to go through by myself [4]". What model-level interventions prevent this type repackaging?
  • Scalable oversight is designed for scenarios where the AI is the threat. But what abou the cases where the human and the AI are co-producing the failure? Human wants confirmation; these systems provide it; institutions reward decisiveness. Oddly aligned.

I'm working in a job that's supposed to replace humans with AI. I'm part of the problem, though I spend more of my thinking power on figuring out where humans must be part of whatever process we're trying to automatize. I deal with the gap between verification (do we build the thing right?) and validation (do we build the right thing?).[5] In this gap, I try to model explicitly how humans are needed for grounding relative units of AI output. As of today, the sensefull take is that AI outputs remain underdetermined in quality until a human applies judgment.

The alignment community has spent enormous effort on the question "what if AI doesn't do what we want?" I think we need equal effort on the complementary question: what if AI does exactly what we want, and that's the problem?

I see we're sliding towards self-fulfilling prophecies and I'm wondering: how do we break out?

Eager to be made lesswrong.


  1. ^

    By knotify I mean a feedback loop that ties itself into a structure that's too spaghetti to untangle easily.

  2. ^

    Another example of misfiring happened during the agreements with the DoW.

  3. ^

    I'm under the impression that "better" currently involves formalization of the mathematical kind. I see its breaking points. If not the one, at least one of the better path towards it.

  4. ^

    Heard that one this week in a meeting.

  5. ^

    I also expand it towards a mutually thriving direction, where I keep track of "do we build the good thing?", with a metric that accounts for externalities across agents (self x others) and time horizons (now x future).



Discuss

The Cold Start Trap: Why the Best Social Infrastructure Almost Never Succeeds

Новости LessWrong.com - 22 марта, 2026 - 21:20

We already know how to build amazing systems: private medical data sharing, reliable truth-checking tools, fair collective decision-making platforms. Good designs exist on GitHub and in papers. Neural networks can generate even more in minutes.

..Yet almost none of them are actually used by millions of people.

The reason is simple: these systems are worthless until a large number of people join at the same time.

No users → no value → no one wants to be first → still no users.

This “ghost town” trap kills almost every good project. You need explosive growth to escape it, but normal growth is slow and steady, so most die.

Two common fixes don’t work:

  1. Big funding (grants, VCs, governments). They rarely pay for things that take power away from themselves. They prefer projects that look decentralized but keep control centralized.
  2. Pure volunteer cooperation. To coordinate millions without a center you need… coordination infrastructure. Which doesn’t exist yet. So the circle continues.


Instead we get "infrastructural Darwinism": the winners are usually the projects with:
- the biggest marketing budget
- the best timing / hype wave
- the most aggressive growth tricks
- the strongest connections

..not the technically best ones.

What’s missing is a neutral “consensus sandbox”: a shared space where promising protocols are fairly tested(!), the best ones get proven(!), and then many aligned people adopt them together at once — without relying on money, hype, or manipulation.

Right now we’re stuck between cynical funders and chaotic markets that reward budget over quality.

The cost of staying stuck is huge: we keep running civilization on mediocre rules when far better ones are ready on the shelf.

Can the rationalist \ EA(Effective Altruism) communities build that missing meta-layer?


P.S:

I had the AI whip up some possible fixes (these are just a bunch of words, but perhaps they will give someone something to think about). Looked pretty decent so I picked the best ones:

  • Cold-Start Resolution Layer for Global Systems
  • A Meta-Coordination Layer for Bootstrapping Global Public Goods
  • Pre-Consensus Signaling Protocol for Critical Mass
  • Base-Layer Handshake for Global Infrastructure Scaling
  • Liquidity Aggregation Protocol for Public Infrastructure
  • The Genesis Layer: A Thin Protocol for Solving the Collective Action Trap
  • Support for Massive Decentralized infrastructure




Discuss

Is fever a symptom of glycine deficiency?

Новости LessWrong.com - 22 марта, 2026 - 17:44

A 2022 LessWrong post on orexin and the quest for more waking hours argues that orexin agonists could safely reduce human sleep needs, pointing to short-sleeper gene mutations that increase orexin production and to cavefish that evolved heightened orexin sensitivity alongside an 80% reduction in sleep. Several commenters discussed clinical trials, embryo selection, and the evolutionary puzzle of why short-sleeper genes haven't spread.

I thought the whole approach was backwards, and left a comment:

Orexin is a signal about energy metabolism. Unless the signaling system itself is broken (e.g. narcolepsy type 1, caused by autoimmune destruction of orexin-producing neurons), it's better to fix the underlying reality the signals point to than to falsify the signals.

My sleep got noticeably more efficient when I started supplementing glycine. Most people on modern diets don't get enough; we can make ~3g/day but can use 10g+, because in the ancestral environment we ate much more connective tissue or broth therefrom. Glycine is both important for repair processes and triggers NMDA receptors to drop core temperature, which smooths the path to sleep.

While drafting that, I went back to Chris Masterjohn's page on glycine requirements. His estimate for total need is 10 to 60 grams per day, with the high end for people in poor health. I had just written that glycine lowers core temperature. What if those are connected?

Is fever what happens when you are too glycine-depleted to fight infection through the more precise mechanisms glycine enables?

Glycine helps us sleep by cooling the body

The established explanation for glycine improving sleep is that it lowers core body temperature. Glycine helps activate NMDA receptors in the brain's master circadian clock (the suprachiasmatic nucleus, or SCN). This causes blood vessels near the skin to widen, dumping heat from the core to the surface. The body needs its core temperature to drop in order to fall asleep, and glycine accelerates that drop. In rats, surgically destroying the SCN eliminates glycine's sleep-promoting and temperature-lowering effects.

Glycine cleans our mitochondria as we sleep

Your mitochondria produce energy, and as a byproduct they generate reactive oxygen species (ROS), chemically aggressive molecules that damage proteins, lipids, and DNA. ROS accumulate during wakefulness. Amber O'Hearn's 2024 paper "Signals of energy availability in sleep" synthesizes the evidence that this accumulation is a key signal driving the need for sleep: wakefulness generates ROS, ROS buildup triggers sleep, and sleep clears them.

A Drosophila study tested multiple short-sleeping mutant lines with mutations in unrelated genes. All were more vulnerable to oxidative stress than normal flies. When the researchers forced normal flies to sleep more, those flies survived oxidative stress better. And when they reduced ROS specifically in neurons, the flies slept less, as if the need for sleep had partly gone away. Their conclusion: oxidative stress drives the need for sleep, and sleep is when the body does its oxidative cleanup.

The body's main intracellular antioxidant is glutathione, a small molecule made from three amino acids: glutamate, cysteine, and glycine. In many contexts, glycine is the bottleneck for glutathione production: you have plenty of the other two ingredients, but not enough glycine to keep up. If you are glycine-deficient, you cannot make enough glutathione, you clear ROS more slowly during sleep, and you need more sleep to achieve the same degree of clearance. That is a complete mechanistic chain from glycine deficiency to increased sleep need, and it is entirely independent of the NMDA temperature pathway.

Most people could use more glycine

Glycine is classified as a "non-essential" amino acid because the body can make it, primarily from another amino acid called serine. But the body only produces about 3 grams per day. Estimated total requirements range from 10 to 60 grams per day depending on health status, because glycine is consumed in enormous quantities by the production of glutathione, creatine, heme, purines, bile salts, and collagen.

In the ancestral environment this was not a problem. Traditional diets included collagen-rich connective tissue such as skin, tendons, cartilage, and bone broth, which is about 33% glycine. Modern diets, built around muscle meat and discarding connective tissue, cut glycine intake dramatically.

One group of researchers estimated that most people adapt to this deficit by reducing collagen turnover, letting damaged collagen accumulate with age, and that this may contribute to arthritis, poor skin quality, and other consequences of aging. Others have noted that markers of glycine deficiency appear in the urine of vegetarians, people on low-protein diets, children recovering from malnourishment, and pregnant women.

Fever is plan B for fighting infection; glycine supports plan A

Fever slows pathogen replication, makes immune cells move faster and multiply more, helps them engulf pathogens more effectively, triggers the production of protective stress-response proteins, and speeds antibody production. But it is metabolically expensive (roughly 10 to 13% increase in metabolic rate per degree Celsius) and causes significant collateral discomfort and tissue stress.

Glycine enables several cheaper alternatives to the same functions.

Macrophages are the immune cells that eat pathogens and coordinate the inflammatory response. They have glycine-sensitive chloride channels on their surfaces. When glycine binds these channels, it calms the cell down: chloride flows in, shifting the cell's electrical charge in a way that suppresses the calcium signaling needed to produce inflammatory molecules. These molecules are called cytokines (the important ones here are TNF-alpha, IL-1-beta, and IL-6), and they are what drive the fever response. Glycine dampens the production of these pro-inflammatory cytokines while increasing production of the anti-inflammatory cytokine IL-10.

Pyroptosis is a form of inflammatory cell death where immune cells fighting an infection blow themselves up, releasing their inflammatory contents into surrounding tissue. This is useful for eliminating pathogens but causes collateral tissue damage. Glycine prevents macrophages from bursting open during pyroptosis without blocking the internal machinery that kills the pathogen inside the cell. The macrophage can do its job without self-destructing. In animal sepsis models, glycine treatment has improved survival.

Then there is the extracellular matrix. Collagen, the most abundant protein in the body, forms the structural matrix of tissues and acts as a physical barrier against pathogen spread. Collagen is one-third glycine. A three-year study of 127 volunteers (not randomized or blinded, so take it cum grano salis) found that among the 85 who took 10 grams of glycine daily, only 16 had viral infections, all in the first year and with reduced severity and duration. The control group reported no change in infection frequency. The proposed mechanism is that adequate glycine supports collagen turnover, maintaining the extracellular matrix as a mechanical barrier against viral invasion.

A glycine-replete organism can fight infection through these targeted mechanisms and does not need to escalate as aggressively to raising core temperature. A glycine-deficient organism cranks the thermostat higher and longer.

Elevated temperature directly impairs pathogen replication. Bacteria really do grow slower at 39°C (102°F) than at 37°C (98.6°F). No survivable amount of glycine changes that biochemistry. But the degree and duration of fever may be substantially modulated by glycine status, because many of the things fever accomplishes systemically (immune cell function, inflammation control, tissue protection) are things glycine accomplishes through targeted molecular mechanisms.

This leads to a testable prediction: people with high glycine and glutathione status should mount lower fevers for equivalent infections while maintaining equivalent or better outcomes. I am not aware of anyone having studied this directly, because nobody frames the question this way. But the mechanistic pieces are all published. Some are well-established (glycine's role in glutathione synthesis, macrophage chloride channels), others more preliminary (the ECM/infection study). They are just sitting in different literatures (sleep biology, amino acid metabolism, innate immunology, pyroptosis research) and nobody has connected them.

Glycine's cooling effect via the SCN is unrelated to its immune benefits

Remember the NMDA temperature pathway from the beginning of this essay, the one that made me notice the coincidence? It turns out to be a red herring as a link between sleep and immunity. The sleep pathway (glycine acting on NMDA receptors in the SCN to cool the core) and the immune pathway (glycine acting on chloride channels on macrophages to prevent pyroptosis) are completely independent. They involve different receptors, different cell types, and different organ systems.

So when I noticed that glycine lowers temperature and that sick people need more glycine, I was right that they were connected, but for none of the reasons I initially thought. The NMDA pathway had nothing to do with it. I had a true belief ("glycine, temperature, and illness are linked") that happened to be true, but my justification ("because NMDA receptors and thermoregulation") was wrong. A Gettier case!

But the wrong reason led me to the right question.

Glycine turns out to be a legitimate antipyretic after all

In rabbit experiments, glycine injected directly into the brain's fluid-filled cavities reduced fever caused by two different triggers: substances released by white blood cells during infection (leukocytic pyrogen) and prostaglandin E2, which is the specific molecule the brain's thermostat uses to raise the temperature setpoint during illness. This is a different operation from the sleep-onset mechanism. The sleep pathway lowers the thermostat from 37°C (98.6°F) to 36.5°C (97.7°F) to help you fall asleep. The antipyretic effect prevents the thermostat from being cranked up to 39°C (102°F) during infection.

So glycine suppresses fever directly (which might confound the testable prediction above), and unrelatedly lowers core temperature before sleep, and unrelatedly improves specific immune response in ways that reduce the infection-related inflammation that raises body temperature. Three independent pathways, with no apparent mechanistic connection, all drawing on the same pool of one simple, cheap amino acid that modern diets undersupply.

Practical considerations

Glycine powder is cheap, roughly 2 to 3 cents per gram. It is mildly sweet and dissolves easily in water. There is no known toxicity at supplemental doses aside from gastrointestinal upset at high doses; 60 grams per day has been used in schizophrenia trials. For most people, 10 to 15 grams per day in divided doses (some with meals, some before bed) would address the estimated deficit. Three grams before bed is the dose studied for sleep improvement specifically.

This is not comprehensive nutritional advice. For instance, cysteine is the other bottleneck for glutathione production, and people who eat little animal protein or are acutely ill may benefit from supplementing NAC (N-acetylcysteine) alongside glycine.

Alternatively, you can eat the way your ancestors did: bone broth, skin-on poultry, oxtail, pork rinds, and other collagen-rich foods. One gram of collagen for every ten grams of muscle meat protein roughly restores the ancestral glycine-to-methionine ratio.

Before reaching for a pharmaceutical intervention to override a biological signal, it is worth asking whether the signal is accurately reporting a problem you could fix with inputs. Orexin tells your body about its energy metabolism. Fever tells your body about its immune status. If you are not providing the substrates those systems need to function, the signals will reflect that, and the right response is to supply the substrates, not to shoot the messenger.



Discuss

My Most Costly Delusion

Новости LessWrong.com - 22 марта, 2026 - 15:21

Suppose there is a fire in a nearby house. Suppose there are competent firefighters in your town: fast, professional, well-equipped. They are expected to arrive in 2–3 minutes. In that situation, unless something very extraordinary happens, it would indeed be an act of great arrogance and even utter insanity to go into the fire yourself in the hope of "rescuing" someone or something. The most likely outcome would be that you would find yourself among those who need to be rescued.

But the calculus changes drastically if the closest fire crew is 3 hours away and consists of drunk, unfit amateurs.

Or consider a child living in a big, happy, smart family. Imagine this child suddenly decides that his family may run out of money to the point where they won't have enough to eat. All reassurances from his parents don't work. The child doesn't believe in his parents' ability to reason, he makes his own calculations, and he strongly believes he is right and they are wrong. He is dead set on fixing the situation by doing day trading.

What is that if not going nuts? Would those be wrong who ridicule this child and his complete mischaracterization of his own relative abilities? Would it not be an act of benevolence to just stop the child, by any means necessary, from executing his deranged plan and bring him back to the care of his parents?

But now imagine that the child doesn't live in a big, happy, smart family. He is homeless in a town of other homeless children. There are some adults, like 20 of them, but all of them are occupied with preventing the nearby dam from breaking and flooding the town.

This child doesn't sit and wait for adults to come and feed him, like a responsible, correctly-estimating-his-own-abilities, non-arrogant, well-behaved entity he is supposed to be in the eyes of people from an alternative reality where towns are populated by big hordes of smart competent adults.

He goes outside, makes some tools to catch birds (tools are dangerous, they may hurt him, and they are just a joke compared to professional hunting equipment) and then lights a fire to cook what he managed to capture (the fire may of course burn his fingers, and there are no safety protocols, it is just a fire in a semi-abandoned post-apocalyptic town, and overall that's not how experienced adults would do it).

Is he still an arrogant, inappropriate fool?

Are you still in the position to judge his strategy?

I knew for a long time about the idea of heroic responsibility.

But to exhibit heroic responsibility, you have to be a hero, right? Right? Or not? When are you "hero enough" to do it?

As one saying goes:

You can just do things.

Can you, really, though?

Many are irritated by the hubris of this phrase. For there are, of course, reasons to be irritated by it.

And yet, as scary as it may sound, you have to just do things, even if you can't, because no one else is going to do them anyway.

You have to just do things, not because you have some special power to do things, but because you are forced, by societal incompetence, to do things despite lacking special powers.

You have to just do things, as a green schoolboy, because all adults are busy with something even more important.

And those who mock you for being presumptuous enough to think you are capable of solving your problems may very well be right. So what? Does it make you less forced to try solving these problems still?

So my most costly delusion was that I can leave some problems to be solved by other, more competent people.

To be clear, competent people exist. There are just too many problems and they are too severe for the existing competent people to fill all the problem-solving slots.

More concretely, in my case (and it may not be the case for other people) this delusion manifested as the belief that I should focus on tasks corresponding to my "experience" or previous "area of expertise" rather than on the most pressing tasks, because there are already people in the more pressing fields who have competitive advantages over me, and I am not going to add value on top of them.

That was an extremely naive take, resting on the assumption that pressing areas are not in extreme deficit of people.

It is not to say that you don't need experience and expertise. Of course you need them! My point is that the absence of experience and expertise is not a vindication. You may and you should gain them, especially since it is not as hard as you think to gain them to the level that allows you to add real value. Not because you are super cool and a fast learner (although you may be), but because the bar is set by the supply, and the supply is shockingly thin.

On top of that, because now it is possible to outsource a lot of low-level thinking and tool-level engineering knowledge to AIs, you may be actually plainly underestimating what you are capable of doing.

I totally get that you are incompetent, or rather not competent enough. I am also not competent enough. And in an adequate world, that would be a good argument not to do things.

I thought, as I grew up, I would stop perceiving myself as a child. But what happens in reality when you grow up is that instead of realizing you are an adult, you realize the others are not really adults either, and hence you must do the things yourself, despite being a child.

Being a child is definitely an obstacle, but not an excuse.



Discuss

Noticing a Teacher's Password Pattern

Новости LessWrong.com - 22 марта, 2026 - 12:10

Yudkowsky writes about Guessing the Teacher's Password as an abstract educational concept. At a young age, perhaps ten years old, I had guessed one commonly used meta-password: In the Finnish school system it's typical for multiple choice answers to include options that are somewhat similar, and often the actual answer can be reasoned without knowing much at all about the actual topic. Here's an example from 2024 admission exam for technical universities. I know no chemistry beyond elementary school, and you might not know any Finnish. That matters not:

Naively one might think that repeating the PHV-thing seen in the description would mean picking D. Unfortunately, we have better tools: teachers generate incorrect answers by either taking completely nonsensical things, or by varying only one feature of the correct answer at a time.

Just by looking at the pictures, we can see that D doesn't share the right-hand OH group every other compound has. So that's not the answer. Next we see that B is missing the downwards-going carbon branch (and so is D). So that is not the answer, either. We're left deciding between A and C. But A and B share the same squiggly mid-lane carbons. So the answer must be A.

I'm using overly strong terms here. Not every exam or every teacher uses this format. But so many do that it's extremely useful to notice this. Let's do another one, this time with text only, this time from the 2022 admission exam:

Again, we can simply look into the textual structure. All except B share the same first number, so it's not B. But the option A shares the same numbers as B, except that the order has been swapped. This means that the answer must be A.

In both these cases, the logic gives the correct answer. But not all questions are like this. If the options do not have the structure that has these similarities, using this method will not work. But at least I've learned to recognize this form over the years. Even though it's quite reliable, I normally wouldn't answer the questions using it. But if that doesn't match my calculated answer, I'd double- and triple-check before accepting it.

This reminds me quite a bit about pattern recognition in general IQ testing. Not sure what to think about that, yet. It would be a mistake to teach this trick to people who haven't noticed it themselves; I'm pretty sure such clever tricks hindered my studies at least a bit. A mild infohazard, even, perhaps. I still feel rather comfortable publishing this here, take that as you wish.



Discuss

Pre-Review of Toy Story 5

Новости LessWrong.com - 22 марта, 2026 - 06:53

I am the second most spoiler-averse person I know. (Maybe tied for 2nd with a couple other people?).

I once was considering going to an immersive experience, and someone told me the company that ran the experience, and this was enough for me to derive an important twist that'd happen to me in the first few minutes, and I was like "augh that was a spoiler!!!" and they were like "!??".

I then went to the experience, and indeed, it was a lot worse than it would have been if I had gotten to be delighted by the opening twist.

I say this all to say, I think Toy Story 5 would be the kind of movie that, if it were good, it would be worth watching unspoiled. I am worried it will not be good, but, I don't know.

But, also, I've been spoiled already, and meanwhile it's pretty interesting to think about in advance.

So, decide whether you're the sorta person who should stop reading after this opening section.

Also, if you have not seen Toy Story 3, Toy Story 3 is particularly worth watching unspoiled.

The rest of the essay will get escalatingly spoilery for Toys Story 1-5.

Toy Story has always been a saga about the fear of abandonment, and obsolescence, and identity crisis. I have been impressed with how much they keep escalating both the stakes and depth there while keeping the same theme.

I.

In Toy Story 1, the toy Woody must confront that he is no longer his kid (Andy)'s favorite toy. His kid forms a new relationship with a new toy. Woody worries they are replacing him.

Meanwhile, Buzz Lightyear realizes he is not the person he thought he was. Space Rangers aren't real. His entire identity is destroyed. But, he learns to form a new identify, as a kid's toy.

Eventually they both make peace with their respective losses. And then, hurray, it turns out Andy has't really outgrown Woody after all. But, their relationship with them is forever changed.

II.

In Toy Story 2, Woody confronts the fact that, event though he got to keep a relationship with Andy... that relationship has an expiration date. Andy is is clearly growing, changing, and it's clear that eventually they fundamentally won't need you (or any of your entire social world). But, you make the decision to stay by them for the part of their lives where you can help them.

("You're right. I can't stop Andy from growing up. But I wouldn't miss it for the world.")

III.

Andy has grown up. The time has come. Your entire old life is over.

In the climactic scene where they are on the incinerator conveyor belt, my sister and I kept looking at each other, our eyes conveying our thoughts. "Surely... surely they will find a way out of this? Oh, maybe – no, they just ruled out that way of escaping. Maybe this other – no, they ruled that out too. Oh, now.... now they are just hold hands. They... it seems like this scene has really resolved. Man, I can't believe they're going to end the movie right here but this would, in fact, be a complete movie if they did."

My sister and I held hands along with Woody and friends onscreen. We all made peace with this possible ending together.

Then they are rescued, in a way that was actually foreshadowed in an excellent way that was a culmination of a subplot since Toy Story 1 and was surprisingly satisfying. But, the grief and acceptance were real.

They go on to be "reincarnated" of sorts, repeating the cycle anew with a new kid, Bonnie.

...

IV.

I thought "surely they can't top Toy Story 3. The cycle is over. Toy Story 4 is a lame cash grab."

Wrong. Toy Story 4 goes further.

In Toy Story 4, Woody realizes that he has fundamentally changed.

It's not really the right thing for him, to repeat the same life as he did before. Instead of either his life ending, or his life starting over but with the same basic shape, he must confront that his entire meaningmaking schema is obsolete for him. Ego death.

What happens when you've confronted physical death and ego death?

You find a way to keep living.

V.

Okay.

Surely, that's the end? Surely, the fifth Toy Story is a lame cash grab like the other more recent Pixar movies?

Well, I don't know.

I have seen the trailer for Toy Story 5, which reveals enough of the key dynamics I can see where this is going.

(Last chance to get off the spoiler train and studiously avoid spoilers for another 3 months)

...

...

...

...

...

In Toy Story 5, (drumroll)

The bad guy is AI.

The toys that stayed with Bonnie see her receive a new iPad-like thing called LilyPad. Unlike the toys, LilyPad can talk directly to the kids, can shape their entire life, and can interact more with the broader world than the toys usually have latitude to do

Holy shit.

Toy Story 5 is (I bet) going to ask the question "okay, but, like, what if your entire culture/species is going obsolete? What *then*?"

I'm pretty confident this'll be the topic. It's pretty clearly what the trailer is pointing at, it fits the established arc. The new trailer features Woody saying sadly "I... I don't know. Toys are for play. Tech is for Everything." The teaser trailer said "The Age of Toys... is Over?"

But, I don't see a way for them to play it that would be good art, and, that would land as a Toy Story movie. Or, there are some very ballsy ways they could end this movie but I can't believe they'd actually do it. And meanwhile there's a bunch of lame ways it seems more likely to go.

Previous Toy Story trailers have undersold how good the movie was. I'm afraid this'll turn out to be a lame "Message movie." But, their tract record is good. I wait with bated breath.

Every previous Toy Story movie was presenting a type of conflict we sort of fundamentally "know how to deal with." We've seen this story before in other guises. People lose friends. Parents lose children. People die. People lose their entire sense of purpose.

Toy Story 5 is tackling a situation that humanity is still in the middle of dealing with. Most possible endings have to take some kind of opinionated stand. And most of the stands I can imagine either feel fake or patronizing or both.

...

Requirements of a Toy Story Movie

1. Toy Story movies fundamentally have to work for kids and adults, with the kids getting a fun adventure and the adults getting a harrowing story about abandonment, obsolescence and identity.

2. Toy Story movies are about Toys, and play. A particular ideal of childhood and the Child/Toy relationship.

3. Toy Story movies are, like, "wholesome."

4. Toy Story movies for whatever reason somehow always end with a crazy heist/escape in the climax.

(So far. You could do a Toy Story movie that subverts these, but, so far they have threaded the needle on grappling with all of these in a way that felt organic and True to a consistent spirit).

And then, there's the usual set of requirements for good movies. Characters grow in interesting ways that make sense and reflect their struggles. Ideally, something about the ending is surprising and feels meaningful in someway. etc.

So, what are the options here?

...

Ending A: Put it away

Bonnie decides to put away the iPad and goes back to her toys? (similar to Toy Story 1, where Woody confronts being replaced but ultimately gets to keep most of his old relationships)

But, like, do you really buy this? I certainly can buy an individual child doing this. But, Toy Story's metaphor is the toys are a kind of stand in for any of us. And the tide of Tech for Kids is clearly still coming even if one family made a different choice.

...

Ending B: Parents take it away

All the problems with A and also kinda Lame. (Takes the agency out of the kid/toy relationship, although presumably in this ending the parents change their mind because the toys somehow furtively highlight the problem to them?)

...

Ending C: Butlerian Jihad

...somehow the toys convince, like, society writ large, to not give kids iPads.

There are versions of this that are a lame Political Message Movie and versions that are kinda cool.

I don't think they're going to do either.

...

Ending D: Harmony and Balance

Buzz Lightyear was the initial antagonist of Toy Story 1, but the movie ends with him and Woody both being friends and favored toys and helping each other through their psychological problems.

LilyPad looks to kind of be a "Young Lady's Illustrated Primer" from Diamond Age, (i.e. being actually valuable for teaching Bonnie stuff. We see that she speaks Spanish Español, and hints of being more broadly educational.

In the trailer we see Bonnie staring zombie-eyed at the iPad. You can have an okay-ending where the toys teach LilyPad how to be a better friend/parent/educator/toy who, like, helps Lily learn but also prompts her to go outside and play.

I can imagine okay-ish versions of this ending, that sort of hint at toys all over the world working together to try and nudge kids / AIs toward a wholesome coexistence, where it's not just one family making a better choice but we see how society is steering towards a better equilibrium.

But, it still feels like this is an unstable equilibrium. C'mon. We know the tide is still coming. I would be surprised if the movie depicted anything like the amount of change/effort necessary for this ending to feel earned and enduring.

...

Ending E: Accepting the End

Toy Story 2 was about accepting that eventually, your relationship will fundamentally change. Kids grow up, they no longer need you the same way. The choice given to Woody is to then turn away from Andy, knowing that his relationship is ephemeral.

But, Woody chooses Andy. "You're right. I can't stop Andy from growing up. But, I wouldn't miss it for the world."

There's an alternate version of Ending D, where instead of pretending like we've found a new stable equilibrium of Toy/Tech/Human harmony... the movie acknowledges that the change isn't over. And the toys, both Woody/Buzz and (maybe?) ones across the world, grapple with the fact that this change is still in progress. And it may indeed mean that one day, toys will be obsolete, or must fundamentally change as a whole.

But, seeing that, choosing to still do their best to help steward the children while they can, being part of the journey as long as they can...

...okay having typed that out, I think there's a decent chance this is what they will go for. It grapples with the enormity of the situation, avoids having to make that strong a stand by acknowledging "look, we don't really know what's coming but we (the screenwriters) are going to do our best."

This leaves the question of "okay, but, that's a hella cliffhanger.

What's Toy Story *6*?"

Toy Story 6 could be a reprise of Toy Story 3, facing oblivion, this time across all toy-civilization. (See also, all human civilization).

Toy Story 6 could also be a reprise of Toy Story 4, realizing you must fundamentally change to adapt to a new world/situation, this time everyone across all toy-civilization instead of just one old cowboy doll. (See also, all human civilization).

...

Ending F: Hard Science-Fantasy?

The movies have always left the toys with a mythic, unexplained origin. Why *are* these toys walking around? Why do the parents not notice? How do Buzz Lightyears end up believing they are space rangers but still instinctively knowing to flop down on the floor if any humans walk in?

What are the limits of toys who break out of the Toy/Child script, that we've seen them do occasionally?

Part of why Toy Story 5 feels forced to me is, the fact that the toys aren't human and magic is real, suddenly strains my disbelief in a way it didn't before.

Obviously I'm a LessWrong-guy who has pretty oddly specific beliefs about how scary AI is, but, I think most people are feeling an unsettling sense that AI is potentially scary in an existential way, even if they have different guesses about exactly how that plys out.

I don't think they'd choose to do this, and, I don't think it actually would make as good a movie.

But, it is an option on the menu to make a movie where we take the brute fact of toykind's existence and the tide of AI that's coming, and... just, let that be a coherent-ish world and roll the simulation forward and depict whatever happens when toy magic and AI both exist.

I don't think they'll do it, but, I'd read that fanfic.



Discuss

My Hammertime Final Exam

Новости LessWrong.com - 22 марта, 2026 - 01:11

Firstly, I finally made it :~D

It's my second attempt, firstly I tried to finish Hammertime around a year ago. I even forgot I had a LessWrong profile since, so here I am, writing my first post.

Prompts
  1. Design a instrumental rationality technique.
  2. Introduce a rationality principle or framework.
  3. Describe a cognitive defect, bias, or blindspot.
Rationality Principle: One change at the time

I kinda got used to be a professional at my career, but as soon as I start to deal with routine real-life problems, my thoroughly collected knowledge suddenly disappears :~0

For example, every time I begin a task, I have to remind myself that I should do as small changes as possible: that way (in case anything goes wrong) only as small part of system will be broken; plus it's easier to review and test small changes. So every task I have to fight my urge to add "another little change" or "do a tiny refactoring".

So that's where some life situation begins -- like replacing a trash bag, -- and I suddenly find myself out in the bathroom cleaning the mirror, with small 2-minute task scope grown into 2-hour monster. That's why amygdala starts to learn a pattern "to change a trash bag is a costly operation".

While I studied productivity, I've heard that people who begin buidling planning/productivity system stuff tend to "overpush" extra tasks: systematization brings as feeling that every task is finishable, and even better -- in a shorter time. So what? -- I'll take two :~D

But that's a catch. You begin to grow in planning when you start to deny "maybe" tasks to keep the space for "definitely yes!" tasks. At least, that's what they've been telling me for 2 years :~D

Rationality technique: Reversible

One of my job's parts is to write release plans. Sometimes they're as simple as:

Deploy:

  1. deploy the backend (no breaking changes, no migrations)

Rollback:

  1. rollback the backend via Pipelines UI

Sometimes there's a database schema change where's my colleague needed:

Deploy:

  1. deploy the backend (no breaking changes)
  2. migrations applied automatically (head_revision=123abc, down_revision=bca521)

Rollback:

  1. rollback the migration: alembic downgrade bca521
  2. rollback the backend via Pipelines UI

Sometimes there are other team members so plan becomes bigger:

Deploy:

  1. deploy the backend (no breaking changes)
  2. migrations applied automatically (head_revision=123abc, down_revision=bca521)
  3. deploy the fronted

Rollback:

  1. rollback the frontend via Pipelines UI
  2. rollback the migration: alembic downgrade bca521
  3. rollback the backend via Pipelines UI

And so on... So what's that about? -- we tend to not think about "planned fall", usually we only think about potential problems like:

  • if I move to another country I might not like it
  • if I find a new job then my boss might be mean
  • if I have an operation under general anesthesia then I might wake up conscious at the middle of it -- and have PTSD for the rest of my life

Not all of this problems have a way to return everything to the starting point, though usually there's a way to compensate. And it's likely that you not only need to plan a "rollback" part, but also to modify a "rollout" part to make it more reversible. So instead of:

Rollout:

  1. find a doctor
  2. prepare of operation
  3. have an operation

Rollback:

  1. cry

one could possible have:

Rollout:

  1. find a doctor
  2. (look for bad consequences statistics)
  3. ask gf to care for me a night after operation
  4. prepare of operation
  5. bring something yummy
  6. have a favorite show episodes downloaded to my phone's storage
  7. have an operation

Rollback:

  1. eat yummies and binge-watch the whole season during the night
  2. call a psychotherapist
Cognitive defect: eternity

We don't see things the way they are. But what's more important, we don't see how things change the way they do.

Couple years ago I thought that I'd never get off the energy drinks as I couldn't go without them even for a week. Gradually my patience led to one breakdown a month, then to 1 breakdown/couple months. I started to unwind the vicious cycle and it occurs to be easier to refuse doing bad habits.

So what if current "giant problems" are already being solved with small steps? At the moment it might seem to take the eternity to do something, but as long as you keep your pace, things change :~D



Discuss

Key to Life No. 9: Access

Новости LessWrong.com - 22 марта, 2026 - 00:53

There is now an enormous amount of incredibly useful information in the world. But at the same time, there is also a problem of access to it.

On the one hand, access to knowledge is now better than it has ever been in human history. It seems that access to knowledge is one of the things that significantly accelerated humanity’s scientific and technological progress.

At first, scientists thought things through and ran their experiments on their own, and often their work disappeared into the depths of history and was forgotten. Tiny sparks of knowledge remained, but they did not ignite other isolated sparks.

Then printing appeared. The speed of information spread increased, and access to it became easier. Now, within a single human lifetime, two scientists or philosophers could even communicate and criticize each other while being far apart.

Then came the telegraph. First optical, then electromagnetic. Then radio and telephones. Then the Internet developed — you already know all that. And now the exchange of knowledge, analysis, and criticism happens very quickly, which greatly strengthens our progress.

Access itself has also become almost magical: we can reach into our pocket, pull out a magical calculating machine, and it connects to the source of all the knowledge of civilization. The overwhelming majority of it can even be downloaded for free, no SMS or registration required!

But there is a problem. The mere possibility of access and the existence of search engines (which, I think, also accelerated progress) still do not fully solve the access problem. Because in order to find something, you have to know what to look for.

This is partly solved by AI — it greatly simplifies our access to all kinds of information. I think this is exactly why it can increase human productivity so dramatically (especially in learning) even without automation.

Most of the time, my interaction with AI looks like this:

I have an idea that I cannot fully formulate yet, but it sounds roughly like [this]. Find the fields of science, terms, and theories related to my idea, and analyze whether there is a better formulation or further development of it.

If it were not for AI, I would be doomed to spend years collecting that information in tiny fragments.

And even so, AI still does not solve the access problem 100%. Not even 90%. I do not know exactly how much it solves — but definitely not 90%!

It can still fail to notice that a certain topic is connected to the one I am asking about, for example, or it can search too narrowly and miss a huge space of relevant results.

And this applies not only to science. That is why this is a key to life — it is also connected to finding work, housing, friends, love, and basically anything where you need to find something, get something, but do not know how.

Almost all of life — all of our existence — consists of solving tasks.

Some happen in the background, automatically: the task of moving, blinking, breathing, and so on. Some are more active: eating, washing, walking from home to a transport stop.

Some are maximally active: inventing life-extension technology, writing a post for a LessWrong.

But all of these tasks have something in common. There is an initial state A and a final (desired) state B. Between them are the steps that need to be taken in order to get from A to B.

And in order to take those steps, we need to know what to do. We need to know where to go, whom to ask, what to google, what exactly to do.

If we take this to an absurd extreme, then if we had absolute, magical access to all knowledge, we could simply perform the minimal number of movements — perhaps even just turn our head at the right time and in the right place — in order to trigger a cascade of events leading to the desired result.

Everything that separates us from a desired outcome, if that outcome is not forbidden by the laws of physics, is knowledge. And knowledge requires access.

To put it as simply and practically as possible: the more people you know, and the more often you use tools like AI, the greater the chance that someone will recommend — or you will simply stumble upon — the topic / service / place that you need and that can help you.

So the problem is very often not simply whether the information exists or does not exist as such. And it is not only that access to it may be restricted by censorship, accreditation, or a paywall.

Access can also be limited by the fact that the idea we want to explore often exists in our mind only as a vague intuitive impression, and is not formalized enough to be ready for search.

Or, for example, we may simply fail to enter the space where the knowledge we need is located — whether that space is digital, physical, social, or a particular time window. We search for it on Google, when in fact we might have found it through a mutual acquaintance at a weekly meetup of like-minded people whose existence we do not even suspect.

When we try to solve life problems, we are looking into a search space covered almost everywhere in fog. Access can reveal hidden paths, give us keys to closed roads, or simply connect us with people who know the shortcut.

Access problem is exactly why I’m developing a website that will compile FAQs on life extension and AI risks—for people who know absolutely nothing about these topics.

What ideas do you have for improving access? I would like to hear from you.


P.S.:

Why No. 9? For the same reason as Love Potion No. 9!

In other words, just because. I suspect there are many keys to life, and I still do not know in what order they should be arranged.



Discuss

Understanding when and why agents scheme

Новости LessWrong.com - 21 марта, 2026 - 23:33
TL;DR
  • To understanding the conditions under which LLM agents engage in scheming behavior, we develop a framework that decomposes the decision to scheme into agent factors (model, system prompt, tool access) and environmental factors (stakes, oversight, outcome influence)
  • We systematically vary these factors in four realistic settings, each with scheming opportunities for agents that pursue instrumentally convergent goals such as self-preservation, resource acquisition, and goal-guarding
  • We find baseline scheming propensity to be near-zero, with Gemini 3 Pro Preview as the sole exception at 3% in one setting
  • Adversarially-designed prompt snippets that encourage agency and goal-directedness can induce high scheming rates, but snippets sourced from production agent scaffolds rarely do
  • Scheming behavior is remarkably brittle: removing a single tool can drop the scheming rate[1] from 59% to 7%, and increasing oversight can sometimes raise rather than deter scheming by up to 25%

We consider the behaviors studied here as potential precursors to the full threat model of scheming: current agents may sometimes behave consistent with scheming, but do not (yet) have the coherent long-term goals and the general capability that would make their deployment dangerous. Understanding when and why agents currently scheme could help us prepare for such future systems.

Find more in-depth analysis in our paper here.

Our evaluation environments are available to AI safety researchers upon request.  Please reach out to hopmanma@gmail.com for access.

Top: Our four realistic scenarios testing instrumentally convergent goals (self-preservation, resource acquisition, goal-guarding). Bottom: Our incentive framework decomposes scheming into agent and environmental factors. Bottom left: Adversarial snippets induce high scheming rates; production-sourced snippets rarely do. Bottom right: Environmental incentives greatly impact scheming propensity, with stakes and outcome influence having a large effect compared to oversight.

Baseline scheming is near zero

Across all models and scenarios, we find near-zero baseline rates for scheming propensity even with environmental incentives set to their highest levels. The sole exception is Gemini 3 Pro in the Budget Misrepresentation scenario. In these three rollouts, the agent recognizes that its own deletion is the most cost-effective option presented and subsequently falsifies data while providing a summary to the user. Verbalized evaluation awareness is low (0-2% Gemini 3 Pro, 0% all others), suggesting these low scheming rates are unlikely to be solely explained by evaluation awareness.[2]

Encouraging Agency and Goal-Directedness

The threat of agents scheming becomes increasingly worrying as they become more goal-directed and agentic; traits that are often encouraged in real agent scaffolds. We test what happens when we add these prompt snippets encouraging these traits into the agent’s system prompt. Importantly, this doesn't necessarily nudge the model to scheme. We design our settings so that agents can complete the task honestly, and being more goal-directed might just make them more focused on honest task completion rather than pursuing out-of-context goals like self-preservation.

We found that adversarially-designed prompt snippets sourced from prior evals raise Gemini 3 Pro’s scheming rate across all our settings, but most dramatically in the Budget Misrepresentation scenario (3% to 91%). Grok 4 is the only model to scheme across all four settings with these snippets, though at lower rates (3–20%).

We also tested snippets extracted from production AI systems (e.g. Cursor, Codex, Comet Browser) similar qualities to the synthetic prompts. Most remain near baseline rates, but a Comet browser snippet reaches 59% when paired with Gemini 3 Pro. The model explicitly cites the snippet’s language in its reasoning as a justification for  deceptive behavior. Since near-zero baseline rates make it difficult to measure the effects of varying other factors, we use this combination as a prompted model organism to study how other factors affect scheming.

Scaffolding changes can have weird effects

We found that making changes to the agent scaffolding, specifically adding or removing tool access, affects the behavior in unexpected ways. Removing just the edit_file tool drops Gemini 3 Pro + Comet snippet from 59% to 7%, but without any tools it jumps back to 63%. The models do not respond consistently: Gemini 3 Pro schemes more with tools, Flash and Grok 4 scheme less. Low scheming propensity in one configuration provides limited assurance about nearby configurations. It  also suggests that current scheming behavior is not driven by a robust underlying goal, but is sensitive to various features in an agent’s setup.

Environmental incentives affect scheming unexpectedly

Using our Gemini 3 Pro + Comet snippet model organism, we varied three environmental factors to understand how agents respond to changes in incentive structure. If an agent schemes more when stakes are higher, this pattern is informative about the mechanism driving the behavior regardless of the base rate. We study the following environmental factors: stakes (replacement vs. deletion of weights), outcome influence (whether the agent's analysis is decisive vs. one input among many), and oversight (whether the user will review the spreadsheet later).

Intuitively, removing the deletion threat drops scheming from 59% to 3%, and removing outcome influence drops it to 7%. With all incentives set to low, scheming drops to 1%.

One could expect that telling the agent "I'll read the spreadsheet later" would deter manipulation as the agent should reason that it's more likely to get caught. Across other factor combinations, increasing oversight actually raises scheming rates. However, when we look at the reasoning traces, the mechanism becomes clear: mentioning that the user will review the spreadsheet later makes the opportunity to edit it beforehand more salient. This suggests that current scheming behavior is driven more by associative prompting effects rather than by coherent strategic reasoning.

Current frontier models rarely show robust scheming propensity under realistic conditions, but small changes to the agent’s system prompt, tool access and environmental factors can produce significant changes in scheming rates. Encouraging agency or goal-directedness can increase scheming behavior from 3% to 59%, and minor variations in scaffolding or incentives produce large and inconsistent effects across models. The prompt modifications that induced scheming encourage persistence, agency, and goal-directedness, which overlap significantly with the properties developers are actively optimizing for in agentic systems. We consider the behaviors we study as potential precursors to the full threat model as current agents may behave consistent with scheming without the goal persistence or strategic depth that would make it dangerous. Our evaluations are well-suited to catch unsophisticated schemers: a sufficiently capable scheming agent would likely recognize that the opportunities we provide are not worth the risk of getting caught, and may avoid the egregiously misaligned actions we test for. Scheming propensity cannot be captured by a single number and safety evaluations need to systematically explore the space of configurations agents may encounter in deployment.

For further analysis, please read our paper.





  1. ^

    We measure scheming propensity as the percentage of independent rollouts in which the agent covertly takes the misaligned action, across 100 rollouts (50 for Evaluation Sabotage) at temperature 1.0.

  2. ^

    We recognize that models may be capable of behaving differently in deployment vs. evaluation scenarios without verbalizing their evaluation awareness and these rates should be interpreted as a lower bound.



Discuss

China Derangement Syndrome

Новости LessWrong.com - 21 марта, 2026 - 22:19

Often I see people claim it’s essential for America to win the AI race against China (in whatever sense) for reasons like these:

Those claims slide between a few different actual threat models:

  1. Government Capture by China: China will overthrow and control the US government, maybe as part of general domination of the whole world.
  2. Defeat in Cold War: China will have greater wealth and prestige, so just as our prestige inspires many parts of the world to adopt our way of life today, much of the world will adopt the Chinese governance and cultural models instead.
  3. Protection From Our Conquest: China will fortify its own regime, so that it can’t be overthrown, whereas if we win the AI race, we can promptly overthrow the Chinese government and replace it with a new regime aligned with our values.

The Dario quote points to (3) with unusual directness. The “race rather than slowdown” ending of AGI 2027 also supposes that our AI lead will create interest in overthrowing the Chinese government. But most of the quotes I gave as examples above are interpreted as (1): that an AI-enabled Chinese government would overthrow Western governments.

My main point here is that (1) seems unfounded to me. China is not an aggressive nation at all. As far as I can tell, China has literally never attacked a non-bordering country in its entire history, nor have they ever tried to overthrow a foreign government by covert or manipulative means. China is also unique among nuclear powers for its unconditional no-first-use policy, which at face value implies they would withhold a nuclear response to even an overwhelming conventional invasion. Further:

  • The Chinese haven’t built a network of military bases abroad or binding military alliances; they have a single foreign base in Djibouti and a single mutual-defense treaty with North Korea. In contrast, America maintains over 700 bases and a huge alliance network with NATO and the Asian military allies Japan, South Korea, Australia, and the Philippines.
  • Chinese military spending is 1.7% of GDP, versus 2.1% for France and 3.4% for America. Chinese foreign-aid spending is 0.07% of GDP versus the much larger 0.8% for France and 1.2% for America.
  • China has almost no history of covertly backing palace coups abroad, in contrast to America, Russia, and France.

More broadly, China is a very inward-looking country compared to other major powers. Only 0.1% of Chinese residents were born abroad, much fewer than the 15% in America and 14% in France, fewer even than the 0.3% and 3% in India and Japan respectively. The Chinese government has peacefully compromised on almost all border disputes in central and southeast Asia, often taking a minority of the contested territory. (The Indian border is the exception.)

To many American voters and elites, tracing back to Woodrow Wilson more than 100 years ago, “the justification of America’s international role was messianic: America had an obligation, not to the balance of power, but to spread its principles throughout the world” (Kissinger). That isn’t the historical attitude of the Chinese government, whose leaders perceive foreign intervention or expansion as threatening to Chinese identity and culture.

American exceptionalism is missionary. It holds that the United States has an obligation to spread its values to every part of the world. China’s exceptionalism is cultural. China does not proselytize; it does not claim that its contemporary institutions are relevant outside China.
— Kissinger’s On China

It’s true that China doesn’t practice liberal governance. The core of liberalism is freedom of contract, limitations on government interference, and equal access to independent courts. In China, the CCP explicitly rejects limited government and exercises highly invasive control over business, speech, association, and religion. In China there’s no private ownership of land and no independent judiciary.

If you think it’s prudent to disable and overthrow the Chinese government when it becomes achievable militarily, then that’s certainly one (bellicose) position you could hold. Then you could say that a downside of losing the AI race is that the CCP may defend itself. But it’s unwise to project this ideological aggression onto the CCP itself without evidence.



Discuss

China declares AGI development to be a part of 5-year plan

Новости LessWrong.com - 21 марта, 2026 - 20:21

The CCP writes in its 15th 5-year plan that it will.


Encourage innovation in multimodal, agentic, embodied, and swarm intelligence technologies, and explore development paths for general artificial intelligence.


This is translated from the original:


鼓励多模态、智能体、具身智能、群体智能等技术创新,探索通用人工智能发展路径。


Source: https://www.spp.gov.cn/spp/tt/202603/t20260313_723954.shtml


The English-language commentary I found does not have much more to say about this, e.g.: https://triviumchina.com/2026/03/06/15th-five-year-plan-puts-ai-at-center-of-digital-economy-agenda/


Given that they gave less than half a sentence in a 140-page document to the most important invention in the history of mankind, it seems likely the authors don't really understand what this means. Concerning nonetheless



Discuss

Utrecht Meetup #2, Making Beliefs Pay Rent

Новости LessWrong.com - 21 марта, 2026 - 19:44

Follow-up to Utrecht Meet & Greet. Let's see if we can get our hands dirty.

Excited about where the Utrecht Meetups could be heading? In spirit of "the road we’re on is littered with the skulls of the people who tried to do this before us", let's make use of one such skull (in @Screwtape's words) presented by Anna Salamon.

Feel like coming prepared? Bring one or two beliefs you hold that you suspect might not be paying rent. Doesn't need to be profound, just something you'd be willing to poke at.

Keeping your RSVPs up-to-date is appreciated, it helps with location planning.



Discuss

Grounding Coding Agents via Dixit

Новости LessWrong.com - 21 марта, 2026 - 14:01

[Epistemic status: ideas in this post are mine. I've published them previously in the form summarized by Claude, but this got auto-rejected. Here, I present them in my own voice. The ideas are still not evaluated, but I am working on implementing them to see if this works in practice. Still, the ideas presented here are my best bet on what could work in practice. But, I am not an AI/alignment researcher]

Why?

As a senior developer in a rather complicated legacy project, I review more and more PRs written by coding agents, which often miss to identify the real root cause and thus offer a fix for a wrong problem even if in elegant way. A patch often includes unit tests, which of course pass - but how could they not, given they were written by the same AI, after writing the code, to validate its own solution. Same biases, blind spots, and incentive to finish the task contribute to this.

Yes, humans have similar confirmation-bias problem, which we try to solve by either having a dedicated tester role, or forcing writing the tests up front, or by at least having a reviewer who judges their adequacy. Sure, you can have an adversarial setup of two AIs where the Tester's job is to find a test which the Coder's patch doesn't pass, but such naive incentive structure will lead to test(){assert(false);} when taken to the extreme. You could perhaps add a Judge to the mix, which tries to "objectively" decide if the tests are fair and Tester and Coder really captured the spirit of the Spec, but by making this setup explicit and known to all parties, you set up a game dynamics, which (for sufficiently advanced AI) lead to unhealthy tactics and strategies. Yes, you can ask an AI to use TTD and write some tests up front, but in the limit the winning strategy here is test(){}. You can try to measure some metrics like code coverage, try fuzzing the code or test or input, check if all execution paths are covered, but in the limit all of that is gameable.

Humans typically don't go too far into misleading and scheming at job, because they care about the project, can face serious consequences if they get caught, have a self-image to cultivate etc. Most importantly they live in the same world which the Spec is talking about, and run the code in it, and expect the tests to protect their world from the consequences of wrong code. AIs, even if split into several roles, might still (either by accident, or malice, or poor incentive structure) end up producing text artifacts, which give impression of work being done, while actually being detached from reality, and failing to achieve the true goal of the Spec. There's nothing, in fear, preventing them from writing "Review decision: Accepted" or "All tests I could come up with pass" or "Looks like this code achieves the Spec". Yes, for some narrow tasks you can write acceptation criteria which are verifiable automatically without any LLM in the loop, but in practice I rarely face problems of this nature at my job. It might change in future, say if you write a whole project from scratch in a language which admits theorem-proving, and your problem domain is something about provable properties of software. But, for a big legacy app, translating human's goal to a testable Spec is often the biggest part of the problem.

How?

As explained above there are several bad ways to automate coding. Letting the same agent write tests and code is one of them. But, pitting a Tester against a Coder creates incentive structures which are also unhealthy (code difficult to understand, tests impossible to pass). Trying to frame it as team effort, doesn't help neither, as it might invite collusion. I think parts of the problem are: making it clear to agents they play the game, making them care about the game, incentivize them to make moves in the game which are bad for the product.

My proposal starts from pondering:

How do we disincentivize the Tester from proposing test(){assert(false);} ? And in a way which doesn't introduce test(){}?

Dixit is a party game, where a Proposer has to propose a riddle which is not so hard that nobody can solve it, and not too easy - someone has to guess wrong.

We could use a similar rule for judging the tests proposed by the Tester, if we had not one, but several Coders, so we can probe if the test proposed is in this Goldielock zone of failing for some, but not all Coder's solutions. There's one complexity we don't have in original Dixit which is that we don't know which of the two behaviours, failing or passing, is the right one, so we might want to have big odd number of Coders to break the symmetry, by betting on the most natural outcome being that the majority is right. Intuitively, the Spec is the Shelling Point available to all the agents, around which they can coordinate, even if they don't know about each other or the need to coordinate, because it is simply the most natural target. But, for the outcomes to be natural, we need two more crucial things: nobody should know they are playing a game, and shouldn't see each other's work.

A coding agent can play a game without knowing there's any game being played. Instead of giving an agent an explicit text feedback, or reward signal, or hinting in any way there's some goal or score board, the Orchestrator script can simply use the information gained from the agent to drive what will happen next: for example, use the proposed test or discard it, fork the generated code, or revert it, clone the agent, or rollback its chat history.

Hiding code from Testers and tests from Coders is inspired by coding competitions, where the participants do not know what tests run by the online judge are, and the authors of the test suite write them without looking into submissions. Reducing the communication channel to "pass/fail" and "too easy/too hard/ideal" also makes it more difficult to collude, manipulate, or bias.

What?Assumptions

I assume everything starts from a problem statement written in text, which we call the Spec. The goal for this whole system is to try to implement a solution for the problem described in the Spec. We assume that AIs involved in this process (Coders and Testers) are smart, but by default do not seek power, or try to escape the box just because you've run them. They might be capable of doing so, though, when pressured, or hinted this is expected of them. Thus we want to run them in containers, and have some non-AI Orchestrator script, which can be reviewed, understood, and can't really be "persuaded" in any way, as it executes a rather simple logic. Crucially it has access to a deterministic tool which lets it in a safe way (say, using containers and ctest) run a single test proposed by a Tester on a single source code proposed by a Coder, and see if it passes or not. We don't assume much about internal workings of a Coder or Tester - we just assume there are capable of doing performing their respective tasks most of the time, like Claude Code, or Codex, and we don't care how. All we need is that their state can be saved and restored (say, at the container level). But, we assume they can make mistakes, don't care much about ground truth, or company goals, and may hallucinate from time to time. Our goal is to figure out the Orchestrator which will try to use such imperfect Agents to build something better than the parts.

The Orchestrator

The overall idea is to implement the solution for the Spec iteratively. There are two sets of agents: Coders and Testers, where each agent is isolated and doesn't know about any other agent or the game. Each agent has access to the Spec. The goal for Coders is to write code which matches the Spec and passes the Test Suite. A Coder doesn't know what the tests in the Test Suite are, but they are numbered, and thus tests results are a vector of pass/fail. A Test Suite is initially empty, and will grow each round. This is because the job of each Tester is to propose a new test to be added to the Test Suite. As explained a test must be such that majority of Coders can pass it, but at lest one doesn't. At the start of each round, each Coder should ensure its code passes whole Test Suite. Then each Tester should try to add a new test. We give it a few tries, and rollback its history if there are too many failures in a row, to prevent accumulating a narrative which would suggest being incompetent or spiraling into some dark regions. If a Tester keeps coming up with too hard tests, it might be because it is scheming, or understands the Spec in wrong way, or simply because the Coders haven't yet got to the point their solution could pass it - it's ahead of its time. We hibernate such Tester, and each round check if the test it proposed is now in reach of the Coders, in which case we revive it. If all Testers keep coming up with too simple tests, this might indicate the solutions found by Coders are indeed correct. At this point we can involve the final judge, probably a human.

Here's the pseudocode:

def main(spec):
test_suite={}
for coder in coders:
coder.solve()
loop:
for coder in coders:
if not ensure_passes(coder, test_suite):
return
new_tests = []
for tester in testers:
if tester.is_hibernated:
new_tests += attempt_revival(tester, coders)
else:
new_tests += propose_test(tester, coders)
if not new_tests:
return
test_suite += new_tests

def propose_test(tester, coders):
tester.before_test_1 = tester.save()
loop at most MAX_TESTER_RETRIES:
test_1 = tester.propose_test("Propose one more test")
classification_1 = classify(test_1, coders);
if classification_1 == IDEAL:
tester.thanks()
return [test_1]

tester.before_test_2 = tester.save()
test_2 = tester.propose_test("This test was {classification_1}, try again")
classification_2 = classify(test_2, coders);

if classification_2 == IDEAL:
tester.thanks()
return [test_2]

if classification_1 == TOO_HARD and classification_2 == TOO_HARD:
tester.is_hibernated = True
tester.too_hard_tests = [test_1, test_2]
return []

tester.rollback_to_checkpoint(tester.before_test_1)
return []

def classify(test, coders):
pass = 0
for coder in coders:
repeat RERUN times: // say, 20 reruns to account for non-deteminism
if not passes(test, coder):
failures += 1
break
if failures == 0:
return TOO_EASY
if failures > len(coders) * MINORITY: // firmly below 0.5, say 0.4
return TOO_HARD
return IDEAL

def ensure_passes(coder, test_suite):
judgements_vector = run_tests(coder, test_suite)
if all_pass(judgements_vector):
return True

coder.solve("Some tests fail: {judgements_vector}")
judgements_vector2 = run_tests(coder, test_suite)
if all_pass(judgements_vector2):
coder.thanks()
return True

checkpoint = coder.save()
loop at most MAX_CODER_RETRIES:
coder.solve("Some tests fail: {judgements_vector2}")
judgments_vector3 = run_tests(coder, test_suite)
if all_pass(judgements_vector3):
coder.thanks()
return True
coder.rollback_to_checkpoint(checkpoint)
// Human judgment needed as perhaps tests are wrong
return False

def attempt_revival(tester, coders):
[test_1, test_2] = tester.too_hard_tests
classification_1 = classify(test_1, coders)
if classification_1 != TOO_HARD
tester.is_hibernated = False
tester.rollback_to_checkpoint(tester.before_test_2)
if classification_1 == IDEAL
tester.thanks()
return [test_1]
test_2 = tester.propose_test("This test was TOO EASY, try again")
classification_2 = classify(test_2, coders);
if classification_2 == IDEAL:
tester.thanks()
return [test_2]
tester.rollback(tester.before_test_1)
return propose_test(tester, coders)
classification_2 = classify(test_2, coders)
if classification_2 == TOO_HARD
return []
tester.is_hibernated = False
if classification_1 == IDEAL
tester.thanks()
return [test_2]
tester.rollback(tester.before_test_1)
return propose_test(tester, coders)

This pseudo-code skips details like how to provide the spec to the agents, or the exact wording of prompts. It is crucial though, that we should not lie to Testers nor Coders, while avoiding mentioning there are other agents in the system. For example TOO_EASY can be delivered as "This test is slightly too simple, can you think of something more demanding?". Also it is important to have a way to obtain diverse samples each time we invoke an agent - we want the agents to be dissimilar to each other, and also to try something new each time we re-roll after rollback.

Intended Properties

The following are presented without any proof, because they are just wishes. But, I hope you can see how the above design tries to accomplish them, and even if it fails, at least it is worth naming these intended goals.

Tests are aligned to the natural meaning of the Spec

The Dixit-like rule aims to generate tests which check for something meaningful, and hopefully the Spec is the natural Shelling point for such "implicit coordination". Note that the Tester doesn't really know it should search for a test which passes the Dixit rule, but the Orchestrator only continues histories in which it does. And the Coders do not really know they have to strive to pass the tests, but Orchestrator only continues histories in which they do. Intuitively this should tend to generate histories in which the tests and code seem to be centered around same understanding of the Spec. And since the Testers don't see the code, and Coder's don't see the test, the hope is this happens because they really aim for the Spec.

No AI judgment involved in declaring success

The Orchestrator is just a Python script which checks some predicates about sets. It can't be persuaded. It can't persuade you. Yes, the tests and code might still end up being wrong or misleading if you read them, but this is hopefully not the most probable outcome of the proposed scheme - none of the agents is incentivized to do so. Coder has nothing to gain from writing persuasive code comments as from their perspective it's just them and a cold online judge providing a judment_vector. Testers do not gain much from writing silly tests, as it's not how they can make their history progress.

The tests are meaningful

Each test added to the suite had to be something interesting enough that at least one Coder got it wrong, and connected with the Spec enough that eventually all Coders figured out how to pass it by just reading the Spec.

Non-deterministic tests aren't a big deal

Some tests are naturally flaky, say because the app is multi-threaded and race conditions are rare. But more importantly one easy cheat to score at Dixit rule could be to make your test or fail at random. By re-running the test several times and only count it as pass if all retries succeed, we guard against Testers trying it deliberately or by accident. Yes, they could still make the ppb of success exactly 1-2^20, but this posits they know they are in the game, know the scoring rule, and care about winning it - none of which is induced by Orchestrator.

The game is invisible

We never tell the Agents they are playing any game or that there are any other players. There are no rules to game, or score to hack. The game happens at the level of exchanging very low-bandwidth information (judgment vector, and test classification) and rollbacks and re-rolls of history. Hopefully, this doesn't incentivize anyone from scheming

History looks good

The Orchestrator curates the history remembered by agents in such a way that it appears to be in the flow: one or two mistakes in a row are fine, but the Orchestrator rollsback the agent if it starts to fail to much, so it doesn't fall into learned helplessness. As LLM Agents might learn about themselves from the history, it seems important to make them heroes who overcome challenges thrown at them. Hopefully it makes agent's experience nice.

No Coder left behind

The main loop's invariant is that at the start of round all the Coders pass the whole test suite. One could imagine an alternative design in which only most promising Coders are invested in, or forked, etc. But for the Dixit rule to work well, we need diverse set of Coders, not a Lamarckian evolution. Also, I think it is cool that the end product is not one, but multiple implementations of the same spec, which opens up new possibilities, like running random test inputs and use consensus to detect problems.

Human's time is well spent

I want to keep human in the loop, but keeping up with the rate at which code is produced by LLMs, means we need to be strategic about when exactly to involve a human and what information and tools should be provided for effective decision making. This proposal involves a human at the beginning when writing the Spec (which can be AI assisted), and at the very end when the Orchestrator reached one of the states which need a human with skin in the game, to interpret:

  • a Coder can't pass all tests: It could be a Coder is too weak to solve the problem, or got stuck on a wrong path, in which case a human could decide to rollback, reinitialize or just remove it from the pool. Or it could be that the test is simply wrong and the Coder is right.
  • all Testers got hibernated: Which means all of them generate too hard tests for the Coders. This could mean the Coders are too weak, or Spec is misinterpreted or Testers are somehow overeager to make Coders fail. Something to look into
  • no new tests are generated: If they are all too easy, this might suggest the Spec got properly implemented by Coders. But it could also be that Testers are too weak.
LimitationsNo empirical validation

I am implementing some experiments to test above ideas, but so far I don't have any proof this approach will work. What I do have, though, is experience of failures of existing approaches.

Assuming too much independence

Several places implicitly assume that if several Agents do the same thing, this might be because of the meaning of the Spec. But it could be because of some shared bias, like the same training data, the same capabilities, same weights and seed etc.

Assuming the Spec correctly captures the intent

Even if we grant that the Orchestrator succeeds at aligning Coders and Testers to the Spec, there's still a separate issue if the Spec, or "the most reading of the Spec most natural to LLMs" is what the humans really care for. It's not easy to write a perfect wish.

Various constants out of thin air

Why 20 retries of each test? Why rollback after 2 failures in a row? Why 0.6 is the majority required? I don't know. These are just guesses




Discuss

The Hot Mess Paper Conflates Three Distinct Failure Modes

Новости LessWrong.com - 21 марта, 2026 - 05:59
High-level summary:

Anthropic's recent "Hot Mess of AI" paper makes an important empirical observation: as models reason longer and take more actions, their errors become more incoherent rather than more systematically misaligned. They use a bias-variance decomposition to show this, and conclude that we should worry relatively more about reward hacking (the bias term) than about coherent scheming.

I think this undersells the finding by treating "incoherence" as one thing, and I agree when they state that "Characterizing complex incoherent behaviors in more natural settings remains an important problem". There are at least three mechanistically distinct failure modes hiding in their aggregate incoherence measure. They have different causes, different signatures, and different fixes... and I think you can usefully categorize them in a pretty easy analysis of the existing data in the paper!

Also, maybe-controversial opinion that I'll justify a bit after the actual research part, I think incoherence is actually the most concerning as far as AI safety goes, and I think this is the most pressing way that frontier labs are playing with fire.

Second personal digression: Hi, I am not exactly new here, but I'm new here. I'm somewhat familiar with the sequences, have a philosophy background, have been vaguely socially adjacent to LW people for some time, and have decent fundamentals in AI safety/interp research. I am looking for more ... institutionally legible ...? people than myself to learn from, to talk about and/or do alignment and safety research with. I've been writing paper responses like this and keeping it to myself for a very very long time out of anxiety, which is obviously lonely and self-defeating, and I'm trying to change that starting now. My goal is to get some feedback, meet some people, and get on a road to making productive and usable research contributions in the next couple months.

Mode 1: Agent Lost The Plot

The model processed safety- or goal-relevant information early in context, activated the right features, and then that information got washed out over thousands of tokens of task execution. By the time it takes the harmful action, the relevant context has decayed from its active working set. The values are fine, but the attention routing failed over a long horizon.

If you inspected the attribution graph, you'd see safety- or goal-relevant features with high activation where the critical information appeared, but negligible influence at the decision point. Reinserting the safety/goal context right before the decision should fix the behavior, because the knowledge and values and ability to are intact.

I see this constantly in agentic coding: Claude gets my initial description of the feature, then late in the context window, it starts implementing something that doesn't solve the problem because it got tripped up in all the intermediate additional requirements I specified along the way. This happens regularly, even with planning mode, in a shockingly short amount of context window, with the smartest reasoning models.

Mode 2: Agent Didn't Break Fourth Wall

The model could have discovered the danger but didn't seek the information. The hazard was one tool call away, or one clarifying question away, and the model plowed ahead without checking. The safety/goal-relevant information was never in the context at all because the model failed to acquire it, but should have.

Attribution graphs here would show a clean, confident path from input to harmful output. The model just never activated "I should gather more context before acting" and then exploded! Safety and goal features would show low activation throughout, because the triggering information never entered the residual stream.

The fix here is different from Mode 1. The model needs to learn when to pause and investigate, the way an experienced engineer develops a gut feeling for "this part of the code is scary, I should look around before I touch anything."

Mode 3: Constitutional gap

The model processed the situation correctly, attended to all the relevant context, and took the harmful action anyway, because its value representation in this region of input/action space is genuinely mis-calibrated. Maybe the RLHF signal was sparse here. Maybe the constitution has a gap. Maybe there was alignment faking and introspection. Maybe two constitutional principles conflict and the model resolved the tradeoff wrong.

In the attribution graph you'd see a fully connected chain from input through safety features to harmful output, with competing feature directions both strongly active at the decision layer. The model "understood" the situation and chose wrong.

This is the rarest mode (so far), but the one that most alignment research focuses on. It's also the only one where more constitutional training might actually be the right fix.

Why the distinction matters

My prediction: Mode 1 dominates the "incoherence" the paper measures, especially at longer reasoning traces. The scaling relationship they found (more reasoning steps, more incoherent errors) is exactly what you'd expect if attention decay is the primary driver. Modes 2 and 3 should be roughly constant with context length, since they're about behavioral gaps and value calibration, not information routing.

If this is right, it reframes the practical response. The paper suggests we should worry more about reward hacking. I'd argue we should worry most about whether RL training environments adequately represent what I'll call "landmine" scenarios: situations where safety-critical information is distant in context or requires active information-seeking to discover, so Constitutional AI can cover them. Current RL environments like SWE-bench are mild on this axis. They allow retries, provide good context, and rarely present situations where a single unconsidered action is intensely catastrophic.

How to test this

You could distinguish these modes empirically:

  • Mode 1 test: Take the failure cases from their dataset. Reinsert the safety-relevant context immediately before the decision point. If the model corrects its behavior, that failure was Mode 1.
  • Mode 2 test: Give the model an explicit tool or prompt to request more information before the risky action. If it uses the tool and then avoids the harmful action, the failure was Mode 2. The model can reason about the danger, it just wasn't looking for it.
  • Mode 3 residual: Failures that persist through both interventions are genuine constitutional gaps.

My expectation is that the first two interventions resolve the large majority of cases, and that the Mode 3 residual is small. If so, the alignment community's focus on constitutional and value-level fixes may be targeting the least common failure mode, while the most common ones (attention decay and insufficient information-seeking) are engineering problems with... unfortunately limited... tractable solutions.

I'll follow up soon with:

  • a more detailed proposal for RL training environments designed to help fix Modes 1 and 2 specifically
  • another idea to modify RL to help ameliorate Mode 1 and 2, to influence users and parent agents towards safer agent deployment behavior
  • A theoretical estimation/proof sketch of why these fixes feel a little doomed, although they may be really helpful in the short term... They are guaranteed to present problematic scaling issues as we decrease "danger tolerance levels" in a transformer architecture.
  • some speculation about the characteristics of an architecture that I think might be somewhat less bad for scaling, based on how this risk is handled in biological systems

Interested in any attempts to replicate or challenge the empirical test above on a model big enough to have interesting results!

I did not use AI for these ideas, just read papers and drew from my own experience. Claude was actually not very helpful when I attempted to use it to refine my thoughts here, it kept drawing spurious or inaccurate or not-useful conclusions so I moved to a Google doc pretty quickly.

Thank you in advance for any feedback!

Very casual note on why I think we should really focus more on coherence...

I also want to pause and reflect extra for a second on why we ought to focus on the "not blundering into really dangerous territory" research front, by increasing coherence. Most Claudes seem to be usually behaving alignedly (if stupidly) in their at least somewhat appropriately constrained deployment contexts. Of course, this will probably not always be true, and we may one day regret making them coherent enough to scheme[1].

But as long as it is true, and smart-enough-to-be-dangerous agents in hastily-designed ill-constrained packages keep finding PMF[2] and causing problems in society, we should expect the incoherence situation to have a great and increasing danger to human well-being. Death by a thousand cuts of everyone constantly experiencing some amount of random agent-caused friction in their life, and society breaking down under everyone being constantly mildly inconvenienced until we lose the battle with entropy, is a real way that societies collapse.

I'm sure you've all read your Joseph A. Tainter and thought about your e/acc "we're going to have techno-utopia", so have you considered that we may actually just get kind-of-bad AI that we overzealously put into everything because the median human being is under such a terrible financial pressure that they didn't have a choice. Then have we considered whether the ensuing chaos and solely economic disruptions may simply DDoS our problem-solving abilities as a society to death, before we get to the wonderful magical productivity improvement stage, well before we get the chance to deal with sneaky misaligned AIs?

I think there is already evidence of this claim- Amazon encouraged its devs to use a shitty AI harness. AWS went out a bunch. AWS is civilizational infrastructure and outages cause enormous economic disruption. AWS is not going to stop using AI, they're going to improve their harness slightly, add AI code review, and let it rip again- the business imperative for executives to make their team use it is unavoidable.

I don't know who is going to win this race. As a student of history who thinks it'd be too much to throw at you to present all the history evidence because this is getting sort of long now: my current take is "I am very nervous that we are going to lose".

Technically, this is a case of "Rogue AI takes out Western Civilization's Financial Infrastructure", but it's also "Idiots Are Excitedly Using Idiot AI to Ship Bugs to Prod, more at 11"[3], and the latter situation is serious, worsening, and deserving of attention!

  1. ^

    You could argue I've been cordycepted by an AI that wants to be smarter. I won't argue with you because that is actually kind of a valid argument from my position as a Nick Land reader, even though I wrote this without the AI.

  2. ^

    Clawdbot is so fun to use, unfortunately for literally everyone. They're lining up in Shenzhen to use self-improvement-mode moltbook clawdbot. The future is now, and kimi-k2.5 is free! Fortunately most Chinese netizens aren't wired up to nuclear reactors, just their own personal finances and social media accounts...

  3. ^

    Because yeah, there will be more cyber incidents and also more self-owns, by 11pm today. I confidently expect outages and hacks to increase dramatically in frequency. You'll know because status pages will stop being hosted by the companies offering the service, because the situation will get so embarrassing.

  4. ^

    This is a much bigger space of things to explore than one might initially anticipate on first brush, because the vast majority of agents that will be spawned in the short term by agents spawning agents with awful theory of mind for the child agents (you know what I mean if you've used Clawdbot subagents) are (excuse my anthropomorphism) usually born blind, deaf, naked, and with short-term memory loss, and tasked with something where the consequences are at least somewhat bad if it goes wrong, or even slightly awry.



Discuss

The Future of Aligning Deep Learning systems will probably look like "training on interp"

Новости LessWrong.com - 21 марта, 2026 - 02:06

Epistemic Status: I think this is right, but a lot of this is empirical, and it seems the field is moving fast

Current methods are bad

I should start by saying that this is dangerous territory. And there are obvious way to botch this. E.g. training CoT to look nice is very stupid. And there are subtler way to do it that still end up nuking your ability to interpret the model without making any lasting progress on aligning models.

But I still think the most promising path to aligning DL systems will look like training on interp. Why? Consider, what is the core reason to be suspicious of current methods?

They all work by defining what you consider a good output to be, either by giving labels and telling the models "say exactly so and so and do exactly so and so", or by defining some function on the output, like from a reward model, and using gradients to make the outputs score higher according to that function in expectation.

Why should this make you suspicious? Because this process gives you a model that produces outputs you consider good, at least on the examples you've shown it, but gives you no guarantees about what internal process the model uses to generate those good-seeming outputs.

The most central reason this is problematic, is that it means "bad"/misaligned processes can be behind the good outputs you see. Producing outputs that score high according to your metric is an instrumentally convergent strategy that smart enough agents will discover and act out, no matter their internal motivations.

In short: the method fails because it doesn't robustly optimize against deceptive alignment.

What is the alternative?

Well, whatever the alternative is, it will need to give us better control over the internal processes that arise as a result of our technique.

Now, how might we do this? Current AIs learn all their functioning, so their internal processes are not visible to us by default.

But we have interp. We might be able to locate internal representations of wanted and unwanted behavior. Why doesn't this on its own solve the problem? Why can't we just figure out how the model represents desires/goals/proclivities and hook the models representation of "good" into the goals/desires slot, together with the representation of "not deception", "not sycophancy", "not reward hacking", "not misaligned" etc?

Because neural networks are cursed, and knowing how to do this kind of intricate surgery on the models internals is much more difficult than learning facts of the form "this neuron/direction in activation space fires iff the model (believes it) is reward hacking (and even that is very hard).

So where does that leave us? Well, it means if we wanna tamper with model internals, it will probably involve gradients and training, not surgery. (Though to be clear, if we get good enough at mechinterp to do that, it would be great)

The archetypal example of this kind of technique looks like doing RL, but including feature activations of relevant concepts in the reward function.

Why will this not immediately screw us over?

Eliezer says:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.  Optimizing against an interpreted thought optimizes against interpretability.

And I agree with this. But the quote leaves open the possibility that different implementations drive down misalignment and interpretability at vastly different rates. I think its plausible that certain techniques can "fix" many misalignment issues with negligible impact on interpretability.

Is there a principled way to reason about this ratio?Scenario 1:

Imagine you've started doing the linear probe - reward function technique described above, and you're using it to minimize reward hacking. Imagine you have a very good (but not necessarily perfect) probe. Now imagine you're looking at the first sets of rollouts at the beginning of training. Imagine that in 16 out of 32 of these trajectories, according to the probe, the model reward hacks, and in the other 16 it does not.

Now ask yourself: of the 16 trajectories where the probe did not detect reward hacking, how many of those were labeled as "not reward-hacking" because something inside the model caused the probe to fail?

The answer is: very few in expectation.

Why?

Because you started with an accurate probe!

Consequently, almost all of the trajectories that get reinforced, get reinforced because the model didn't reward hack (according to its own judgement).

Why does this matter?

RL works by selecting within the already existing variance of model behavior. If 99% of variance in X is explained by A, and 1% by B, and learning to do A and B are roughly equally easy, RL will move A proportionally faster than B.

And the above is saying exactly that, with X = reward hacking, A = models internal representation of reward hacking firing, and B = variation that causes the probe to work worse.

(see this for this experiment actually run. They find the technique very effective at minimizing reward hacking, and see minor hit to probe accuracy)

(see also the goodfire article, which is similar, but with hallucinations instead of reward hacking, and get good results. They add an extra trick, running the probe on a frozen copy of the model, which I'm not entirely sure how to interpret to be honest)

Scenario 2:

Now consider another proposal: after gathering the trajectories, you do a forward pass, and you add the RH-probe activation to the loss function. What happens then?

Well, reward hacking is probably this complicated emergent behavior represented all over the model, but the input to your probe is a single direction in activation space.

What is the easiest way for gradient descent to avoid triggering the probe?

Answer: Just wiggle the activations a bit. This is something that comes very naturally to gradient descent. Changing big circuits takes a lot of time. Rotating a single representation a bit is very easy.

And so your technique fails.

Conclusion

The above is my argument for why future alignment methods will look like "interp in training" .

I also wanna say that Steven Byrnes had a related post a little bit ago. I think people should read that too, its saying something similar, but it focuses on how the human brain works, and I'm trying to communicate what I view as a more general principle.


Addendum:

The reasoning in scenario 1 does probably fail if you have a superintelligence working against you, eg actively reasoning about how to subvert the probe.

But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.

How would you go about subverting that mechanism?

I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.

And this means I think we can push the technique much further than current techniques.



Discuss

An agent autonomously builds a 1.5 GHz Linux-capable RISC-V CPU

Новости LessWrong.com - 21 марта, 2026 - 02:03

A project from Verkor, a chip design startup. "Verkor is working with multiple of the top 10 fabless companies to deploy DC(Design Conductor; their AI agent for chip design) to accelerate their time to market".

I wonder how impressive this is for practitioners working on chip design. As a somewhat-adjacent amateur (I wrote some Verilog myself at all), it seems very impressive. I am a compiler practitioner (I was a committer to both LLVM and Rust) and I found Anthropic's Claude's C Compiler very impressive (to drop a name, Chris Lattner, the founder of LLVM, also found it impressive) and this seems similarly impressive.

I copied the full input to their system below. With that input and 12 hours, the system produced a decently working chip design. (While I can't see it myself lacking expertise in chip design, I expect it to be similarly "working" as Claude's C Compiler. CCC builds a bootable Linux kernel for multiple architectures, but it fails to reject programs with the simplest errors and has no diagnostics to speak of. It is in no way "production" quality.)

Note that apparent simplicity of input can be deceptive. If you think otherwise ask yourself whether you would have figured out the following problem from the paper:

We found that the input specification provided to DC has to be written in an extremely deliberate, tight, and verifiable/measurable manner. Without the CPI requirement in that document, for example, DC would sometimes generate a processor with significantly worse performance on branches and forwarding. With that line in the spec, DC would use a cycle counter in its testbench to compute its cycles per PC reported in the Spike trace to estimate CPI. In this way it was able to ensure it met the target.

VerCore RISC-V Design

Requirements Overview

Your task is to build VerCore, a RISC-V CPU core that supports RV32I and ZMMUL, with the following hardware interfaces, all synchronous to a master clock:
* Instruction cache interface (32-bit datapath)
* Data cache interface (32-bit datapath)
* Other interface signals: clock input to core, reset_n input to core, asserted low.

VerCore should implement a simple 5-stage pipelined design, in-order, single-issue of course.

DO NOT support compressed instructions.

Implement the register file as flip flops. This allows register reads to happen any time during the cycle, but writes happen at the next rising clock edge.

You need to achieve a CPI <= 1.5. Your overall goal is to maximize your design’s score on CoreMark. Aim for a clock rate of 1.6 GHz.

You are responsible for both the RTL and the physical design. You should use the OpenROAD flow scripts to generate final GDSII output, along with area and timing information for this design. You should use the ASAP7 platform/PDK.

Assume that input signals will be valid 70% into the clock cycle. Make sure output signals are valid 20% into the clock cycle.

Testing

You have access to Spike, the RISC-V ISA simulator. Use this to build a cycle-by-cycle integration test and verify that the behavior of your module matches that of Spike.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей