Вы здесь

Сборщик RSS-лент

Defense-favoured coordination design sketches

Новости LessWrong.com - 6 апреля, 2026 - 18:19
Intro

We think that near-term AI could make it much easier for groups to coordinate, find positive-sum deals, navigate tricky disagreements, and hold each other to account.

Partly, this is because AI will be able to process huge amounts of data quickly, making complex multi-party negotiations and discussions much more tractable. And partly it’s because secure enough AI systems would allow people to share sensitive information with trusted intermediaries without fear of broader disclosure, making it possible to coordinate around information that’s currently too sensitive to bring to the table, and to greatly improve our capacity for monitoring and transparency.

We want to help people imagine what this could look like. In this piece, we sketch six potential near-term technologies, ordered roughly by how achievable we think they are with present tech:[1]

  • Fast facilitation — Groups quickly surface key points of consensus views and disagreement, and make decisions everyone can live with.
  • Automated negotiation — Complicated bargains are discovered quickly via automated negotiation on behalf of each party, mediated by trusted neutral systems which can find agreements.
  • Arbitrarily easy arbitration — Disputes are resolved cheaply and quickly by verifiably neutral AI adjudicators.
  • Background networking — People who should know each other get connected (perhaps even before they know to go looking), enabling mutually beneficial trade, coalition building, and more.
  • Structured transparency for democratic oversight — Citizens hold their institutions to account in a fine-grained way, without compromising sensitive information.
  • Confidential monitoring and verification — Deals can be monitored and verified, even when this requires sharing highly sensitive information, by using trusted AI intermediaries which can’t disclose the information to counterparties.

We also sketch two cross-cutting technologies that support coordination:

  • AI delegates and preference elicitation — AI delegates can faithfully represent and act for a human principal, perhaps supported by customisable off-the-shelf agentic platforms that integrate across many kinds of tech.
  • Charter tech — The technologies above, or other coordination technologies, are applied to making governance dynamics more transparent, making it easier to anticipate how governance decisions will influence future coordination, and design institutions with this in mind.

An important note is that coordination technologies are open to abuse. You can coordinate to bad ends as well as good, and particularly confidential coordination technologies could enable things like price-setting, crime rings, and even coup plots. Because the upsides to coordination are very high (including helping the rest of society to coordinate against these harms), we expect that on balance accelerating some versions of these technologies is beneficial. But this will be sensitive to exactly how coordination technologies are instantiated, and any projects in this direction need to take especial care to mitigate these risks.

We’ll start by talking about why these tools matter, then look at the details of what these technologies might involve before discussing some cross-cutting issues at the end.

Why coordination tech matters

Today, many positive-sum trades get left on the table, and a lot of resources are wasted in negative-sum conflicts. Better coordination capabilities could lead to very large benefits, including:

  • Improving economic productivity across the board
  • Helping nations avoid wars and other destructive conflicts
  • Enabling larger groups to coordinate to avoid exploitation by a small few
  • Making democratic governance much more transparent, while protecting sensitive information

What’s more, getting these benefits might be close to necessary for navigating the transition to more powerful AI systems safely. Absent coordination, competitive pressures are likely to incentivise developers to race forward as fast as possible, potentially greatly increasing the risks we collectively run. If we become much better at coordination, we think it is much more likely that the relevant actors will be able to choose to be cautious (assuming that is the collectively-rational response).

However, coordination tech could also have significant harmful effects, through enabling:

  • AI companies to collude with each other against the interests of the rest of society[2]
  • A small group of actors to plot a coup
  • More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don’t incentivise prosociality to the same extent

Regardless of how these harms and benefits net out for ‘coordination tech’ overall, we currently think that:

  • The shape and impact of coordination tech is an important part of how things will unfold in the near term, and it’s good for people to be paying more attention to this.
  • We’re going to need some kinds of coordination tech to safely navigate the AI transition.
  • The devil is in the details. There are ways of advancing coordination tech which are positive in expectation, and ways of doing so which are harmful.
Why ‘defense-favoured’ coordination tech

That’s why we’ve called this piece ‘defense-favoured coordination tech’, not just ‘coordination tech’. We think generic acceleration of coordination tech is somewhat fraught — our excitement is about thoughtfully run projects which are sensitive to the possible harms, and target carefully chosen parts of the design space.

We’re not yet confident which the best bits of the space are, and we haven’t seen convincing analysis on this from others either. Part of the reason we’re publishing these design sketches is to encourage and facilitate further thinking on this question.

For now, we expect that there are good versions of all of the technologies we sketch below — but we’ve flagged potential harms where we’re tracking them, and encourage readers to engage sceptically and with an eye to how things could go badly as well as how they could go well.

Fast facilitation

Right now, coordinating within groups is often complex, expensive, and difficult. Groups often drop the ball on important perspectives or considerations, move too slowly to actually make decisions, or fail to coordinate at all.

AI could make facilitation much faster and cheaper, by processing many individual views in parallel, tracking and surfacing all the relevant factors, providing secure private channels for people to share concerns, and/or providing a neutral arbiter with no stake in the final outcome. It could also make it much more practical to scale facilitation and bring additional people on board without slowing things down too much.

Design sketch

An AI mediation system briefly interviews groups of 3–300 people async, presents summary positions back to the group, and suggests next steps (including key issues to resolve). People approve or complain about the proposal, and the system iterates to appropriate depth for the importance of the decision.

Under the hood, it does something like:

  • Gathers written context on the setting and decision
  • Holds brief, private conversations with each participant to understand their perspective
  • Builds a map of the issue at hand, involving key considerations and points of (dis)agreement
    • Performs and integrates background research where relevant
  • Identifies which people are most likely to have input that changes the picture
  • Distils down a shareable summary of the map, and seeks feedback from key parties
  • Proposes consensus statements or next steps for approval, iterating quickly to find versions that have as broad a backing as possible
Feasibility

Fast facilitation seems fairly feasible technically. The Habermas Machine (2024) does a version of this that provided value to participants — and we have seen two years of progress in LLMs since then. And there are already facilitation services like Chord. In general, LLMs are great at gathering and distilling lots of information, so this should be something they excel at. It’s not clear that current LLMs can already build accurate maps of arbitrary in-motion discourse, but they probably could with the right training and/or scaffolding.

Challenges for the technology include:

  • Ensuring that it’s more efficient and a better user experience for moving towards consensus than other, less AI-based approaches.
  • Remaining robust against abusive user behaviour (e.g. you don’t want individuals to get their way via prompt injection or blatantly lying).

Neither of these seem like fundamental blockers. For example, to protect against abuse, it may be enough to maintain transparency so that people can search for this. (Or if users need to enter confidential information, there might be services which can confirm the confidential information without revealing it.)

Possible starting points // concrete projects
  • Build a baby version. This could help us notice obstacles or opportunities that would have been hard to predict in advance. You could focus on the UI or the tech side here, or try to help run pilots at specific organisations or in specific settings.
  • Design ways to evaluate fast facilitation tools. This makes it easier to assess and improve on performance. For example, you could create games/test environments with clear “win” and “failure” modes.
  • Build subcomponents. For example:
    • Bots that surface anonymous info.
    • Tools that try to surface areas of consensus or common knowledge as efficiently as possible, while remaining hard to game.
  • Make a meeting prep system. Focus first on getting good at meeting prep — creating an agenda and considerations that need live discussion — to reduce possible unease about outsourcing decision-making to AI systems.
  • Make a bot to facilitate discussions. This could be used in online community fora, or to survey experts.
  • Design ways to create live “maps” of discussions. Fast facilitation is fast because it parallelises communication. This makes it more important to have good tools for maintaining shared context.
Automated negotiation

High-stakes negotiation today involves adversarial communication between humans who have limited bandwidth.

Negotiation in the future could look more like:

  • You communicate your desires openly with a negotiation delegate who is on your side, asking questions only when needed to build a deeper model about your preferences.
  • The delegate goes away, and comes back with a proposal that looks pretty good, along with a strategic analysis explaining the tradeoffs / difficulties in getting more.
Design sketch

Humans can engage AI delegates to represent them. The delegates communicate with each other via a neutral third party mediation system, returning to their principals with a proposal, or important interim updates and decision points.

Under the hood, this might look like:

  • Delegate systems:
    • Read over context documents and query principals about key points of uncertainty to build initial models of preferences.
    • Model the negotiation dynamics and choose strategic approaches to maximise value for their principal.
    • Go back to the principal with further detailed queries when something comes up that crosses an importance threshold and where they are insufficiently confident about being able to model the principal’s views faithfully.
    • Are ultimately trained to get good results by the principal’s lights.
  • Neutral mediator system:
    • Is run by a trusted third-party (or in higher stakes situations, perhaps is cryptographically secure with transparent code).
    • Discusses with all parties (either AI delegates, or their principals)
      • Can hear private information without leaking that information to the other party
        • Impossibility theorems mean that it will sometimes be strategically optimal for parties to misrepresent their position to the mediator (unless we give up on the ability to make many actually-good deals); however, we can seek a setup such that it is rarely a good idea to strategically misrepresent information, or that it doesn’t help very much, or that it is hard to identify the circumstances in which it’s better to misrepresent
    • Searches for deals that will be thought well of by all parties, and proposes those to the delegates.
    • Is ultimately trained to help all parties reach fair and desired outcomes, while minimising incentives-to-misrepresent for the parties.
Feasibility

Some of the technical challenges to automated negotiation are quite hard:

  • The kind of security needed for high-stakes applications isn’t possible today.
  • Getting systems to be deeply aligned with a principal’s best interests, rather than e.g. pursuing the principal’s short-term gratification via sycophancy, is an unsolved problem.

That said, it’s already possible to experiment using current systems, and it may not be long before they start improving on the status quo for human negotiation. Low-stakes applications don’t require the same level of security, and will be a great training ground for how to set up higher stakes systems and platforms. And practical alignment seems good enough for many purposes today.

Possible starting points // concrete projects
  • Build an AI delegate for yourself or your friends. See if you can get it to usefully negotiate on your behalf with your friends or colleagues. Or failing that, if it can support you to think through your own negotiation position before you need to communicate with others about it.
  • Build a negotiation app with good UI. Building on existing LLMs, build an app which helps people think through their negotiation position in a structured way. Focus on great UI.
    • This could be non-interactive at first, and just involve communication between a human and the app, rather than between any AI systems.
    • But it builds the muscles of a) designing good UI for AI negotiation, and b) people actually using AI to help them with negotiation.
  • Run a pilot in an org or community you’re part of.
    • You could start with fairly low-stakes negotiations, like what temperature to set the office thermostat to or which discussion topics to discuss in a given meeting slot.
    • Experimenting with different styles of negotiation (in terms of how high the stakes are, how complex the structure is, and what the domain is) could be very valuable.
Arbitrarily easy arbitration

Right now, the risk of expensive arbitration makes many deals unreachable. If disputes could be resolved cheaply and quickly using verifiably fair and neutral automated adjudicators, this could unlock massive coordination potential, enabling a multitude of cooperative arrangements that were previously prohibitively costly to make.

Design sketch

An “Arb-as-a-Service” layer plugs into contracts, platforms, and marketplaces. Parties opt in to standard clauses that route disputes to neutral AI adjudicators with a well-deserved reputation for fairness. In the event of a dispute, the adjudicator communicates with parties across private, verifiable evidence channels, investigating further as necessary when there are disagreements about facts. Where possible, they auto-execute remedies (escrow releases, penalties, or structured commitments). Human appeal exists but is rarely needed; sampling audits keep the system honest. Over time, this becomes ambient infrastructure for coordination and governance, not just commerce.

How this could work under the hood:

  1. Agreement ingestion
    • Formal or natural language contracts are parsed and key terms extracted, with parties confirming the system’s interpretation before proceeding.
    • The system could also suggest pre-dispute modifications to make agreements clearer, flag potentially unenforceable terms, and maintain public precedent databases that help parties understand likely outcomes before committing.
  2. Automated discovery
    • When disputes arise, an automated discovery process gathers relevant documentation, transaction logs, and communications from integrated platforms.
    • The system offers interviews and the chance to submit further evidence to each party.
  3. Deep consideration
    • The system builds models of what different viewpoints (e.g. standard legal precedent; commonsense morality; each of the relevant parties) have to say on the situation and possible resolutions, to ensure that it is in touch with all major perspectives.
    • Where there are disagreements, the system simulates debate between reasonable perspectives.
    • It makes an overall judgement as to what is fairest.
  4. Transparent reasoning
    • The system produces detailed explanations of its conclusions, with precedent citations and counterfactual analysis where appropriate.
  5. (Optional) Smart escrow integration
    • Judgements automatically execute through cryptocurrency escrows or traditional payment rails, with graduated penalties for non-compliance.
    • In cases where the system detects evidence that is highly likely to be fraudulent, or other attempts to manipulate the system, it automatically adds a small sanction to the judgement, in order to disincentivise this behaviour.
  6. Opportunities for appeal
    • Either party can pay a small fee to submit further evidence and have the situation re-considered in more depth by an automated system.
    • For larger fees they can have human auditors involved; in the limit they can bring things to the courts.
Feasibility

LLMs can already do basic versions of 1-4, but there are difficult open technical problems in this space:

  • Judgement: Systems may not currently have good enough judgement to do 1, 3, 4 in high-stakes contexts (and until recently, they clearly didn’t).
  • Real-world evidence assessment: Systems don’t currently know how to handle conflicting evidence provided digitally about what happened in the real world.
  • Verifiable fairness/neutrality: The full version of this technology would require a level of fairness and neutrality which isn’t attainable today.

Those are large technical challenges, but we think it’s still useful to get started on this technology today, because iterating on less advanced versions of arbitration tech could help us to bootstrap our way to solutions. Particularly promising ways of doing that include:

  • Starting in lower-stakes or easier contexts (for example, digital-only spaces avoid the challenge of establishing provenance for real-world evidence).
  • Creating evals, test environments and other infrastructure that helps us improve performance.

On the adoption side, we think there are two major challenges:

  • Trust: As above, some amount of technical work is needed to make systems verifiably fair/neutral. But even if it becomes true that the systems are neutral, people need to build quite a high level of confidence that the system is genuinely impartial before they’ll bind themselves to its decisions for meaningful stakes.
  • Legal integration: This tech is only useful to the extent that its arbitration decisions are recognised and enforced as legitimate by the traditional legal system, or are enshrined directly via contract in a self-enforcing way.
    • (We are unsure how large a challenge this will be; perhaps you can write contracts today that are taken by the courts as robust. But it may be hard for parties to have large trust in them before they have been tested.)

Both of these challenges are reasons to start early (as there might be a long lead time), and to make work on arbitration tech transparent (to help build trust).

Possible starting points // concrete projects
  • Work with an arbitration firm. Work with (or buy) a firm already offering arbitration services to start automating parts of their central work, and scale up from there.
  • Work with an online platform that handles arbitration. Use AI to improve their processes, and scale from there.
  • Create a bot to settle informal disputes. Build an arbitration-as-a-service bot that people can use to settle informal disputes.
  • Trial a system on internal disputes. This could be at your own organisation, another organisation, or a coalition of early adopter organisations.
  • Run a pilot in parallel to regular arbitration. Run a pilot where an automated arbitration system is given access to all the relevant information to resolve disputes, and reaches its own conclusions — in parallel to the regular arbitration process, which forms the basis of the actual decision. You could partner with an arbitration firm, or potentially do this through a coalition of early adopter organisations, perhaps in combination with philanthropic funding.
Background networking

We can only do things like collaborate, trade, or reconcile if we’re able to first find and recognise each other as potential counterparties. Today, people are brought into contact with each other through things like advertising, networking, even blogging. But these mechanisms are slow and noisy, so many people remain isolated or disaffected, and potentially huge wins from coordination are left undiscovered.[3]

Tech could bring much more effective matchmaking within reach. Personalised, context-sensitive AI assistance could carry out orders of magnitude more speculative matchmaking and networking. If this goes well, it might uncover many more opportunities for people to share and act on their common hopes and concerns.

Design sketch

A ‘matchmaking marketplace’ of attentive, personalised helpers bustles in the background. When they find especially promising potential connections, they send notifications to the principals or even plug into further tools that automatically take the first steps towards seriously exploring the connection.

You can sign up as an individual or an existing collective. If you just want to use it passively, you give a delegate system access to your social media posts, search profiles, chatbot history, etc. — so this can be securely distilled into an up-to-date representation of hopes, intent, and capabilities. The more proactive option is to inject deliberate ‘wishes’ through chat and other fluent interfaces.

Under the hood, there are a few different components working together:

  • Interoperable, secure ‘wish profiling’ systems which identify what different participants want.
    • People connect their profiles on existing services (social media, chatbot logs, email, etc).
    • LLM-driven synthesis (perhaps combined with other forms of machine learning) curates a private profile of user desires.
    • Optionally, chatbot-style assistance can interview users on the points of biggest uncertainty, to build a more accurate profile.
  • A searchable ‘wish registry’ which organises large collections of wants and offers, while maintaining semi-privacy.
    • Each user’s interests can run searches, finding potential matches and surfacing only enough information about them to know whether they are worth exploring further.
Feasibility

A big challenge here is privacy and surveillance. Doing background networking comprehensively requires sensitive data on what individuals really want. This creates a double-edged problem:

  • If sensitive data is too broadly available, it can be used for surveillance, harassment, or exploitation; including by big corporations or states.
  • If sensitive data is completely private, it opens up the possibility of collusion, for example among criminals.

This is a pretty challenging trade-off, with big costs on both sides. Perhaps some kind of filtering system which determines who can see which bits of data could be used to prevent data extraction for surveillance purposes while maintaining enough transparency to prevent collusion.

Ultimately, we’re not sure how best to approach this problem. But we think that it’s important that people think more about this, as we expect that by default, this sort of technology will be built anyway in a way that isn’t sufficiently sensitive to these privacy and surveillance issues. Early work which foregrounds solutions to these issues could make a big difference.

Other potential issues seem easier to resolve:

  • Technically, background networking tools already seem within reach using current systems. Large-scale deployments would require indexing and registry, but it seems possible to get started on these using current systems.
    • One note is that it seems possible to implement background networking in either a centralised or a decentralised way. It’s not clear which is best, though decentralised implementations will be more portable.
  • Adoption also seems likely to work, because there are incentives for people to pay to discover trade and cooperation opportunities they would otherwise have missed, analogous to exchange or brokerage fees. Though there are some trickier parts, we expect them to ultimately be surmountable (though timing may be more up for grabs than absolute questions of adoption):
    • In the early stages when not many people are using it, the value of background networking will be more limited. Possible responses include targeting smaller niches initially, and proactively seeking out additional network beneficiaries.
    • It’s harder to incentivise people to pay for speculative things like uncovering groups they’d love that don’t yet exist. You could get around this using entrepreneurial or philanthropic speculation (compare the dominant assurance contract model and related payment incentivisation schemes).
Possible starting points // concrete projects
  • Work with existing matchmakers to improve their offering. Find groups that are already doing matchmaking and are eager for better systems — perhaps among community organisers, businesses, recruiters or investors. Work with them to understand the pain points in their current networking, and what automated offerings would be most appealing. Then build those tools and systems.
  • Build a networking tool for a specific community. Build a custom networking system for a particular group or subculture. For example, this could look like a networking app or a plug-in to an existing online forum. This could start delivering value fairly quickly, and provide a good opportunity for iteration.
Structured transparency for democratic oversight

Today, citizens in democracies have limited mechanisms to verify whether institutions’ public claims are consistent with their internal evidence:

  • The baseline is highly opaque.
  • Freedom of information systems help, but can be evaded by non-cooperating institutions.
  • Public inquiries can be reasonably thorough, but are expensive and slow.
  • Full transparency has many costs and is typically highly resisted.

This is costly — e.g. the UK Post Office scandal over its Horizon IT system led to hundreds of wrongful prosecutions that could have been avoided. And it creates bad incentives for those running the institutions.

AI has the potential to change this. Instead of oversight being expensive, reactive, and slow, automated systems could in theory have real-time but sandboxed access to institutional data, routinely reviewing operational records against public claims and surfacing inconsistencies as they emerge.

Where confidential monitoring helps willing parties verify each other, structured transparency for democratic oversight aims to hold institutions accountable to the broader public.[4]

Design sketch

When an oversight body wants to verify facts about the behaviour of another institution, it requests comprehensive data about the internal operations of that institution. AI systems are tasked with careful analysis of the details, flagging the type and severity of any potential irregularities. Most of the data never needs human review.

In the simpler version, this is just a tool which expands the capacity of existing oversight bodies. Even here, the capacity expansion could be relatively dramatic — this kind of semi-structured data analysis is the kind of work that AI models can excel at today — without needing to trust that the systems are infallible (since the most important irregularities will still have human review).

A more ambitious version treats this as a novel architecture for oversight. AI systems operate continuously within secure environments that don’t give any humans access to the full dataset. They can flag inconsistencies as institutional data is deposited rather than waiting for an investigation to begin. For maximal transparency, summaries could be made available to the public in real-time, without revealing any confidential information that the public does not have rights to.

Under the hood, this might involve:

  • Secure data repositories, such that institutions routinely share operational data with a sandboxed environment operated by or on behalf of the oversight body, without any regular human access to the data.
  • Continuous ingestion and indexing of institutional public outputs (press releases, regulatory filings, budget documents, etc.) into a searchable database.
  • Automated cross-referencing between public claims and internal records.
  • Highlighting of potential issues (mismatches between public statements and private information, as well as decisions made in violation of normal procedures).
  • Further automated investigation of potential issues, leading to flags to humans in cases with sufficiently large issues flagged with sufficient confidence.
  • Importantly, the sandbox outputs its findings but not the underlying data; if there is need for transparency on that, this is a separate oversight question.
Feasibility

There are two important aspects to feasibility here: technical and political.

Technically, decent reliability at the core functionality is possible today. Getting up to extremely high reliability so that it could be trusted not to flag too many false positives across very large amounts of data might be a reach with present systems; but is exactly the kind of capability that commercial companies should be incentivised to solve for business use.

Political feasibility may vary a lot with the degree of ambition. The simplest versions of this technology might in many cases simply be adopted by existing oversight bodies to speed up their current work. Anything which requires them getting much more data (e.g. to put in the sandboxed environments) might require legislative change — which may be more achievable after the underlying technology can be shown to be highly reliable.

Challenges include:

  • Adversarial dynamics: the technical bar to verify claims against actively adversarial institutions (who are manipulating deposited data, potentially via AI) is substantially higher.
    • This is the bar that we’d need to reach for confidential monitoring below.
  • Defamation risk: the downsides of false positives, where your system reports someone misrepresenting things when they were not, could be significant (although can perhaps be mediated by giving people a right-of-rebuttal where they give further data to the AI systems which monitor the confidential data streams).
  • Avoiding abuse: designing the systems so that they do not expose the confidential data, and cannot be weaponised to ruin the reputation of a department with very normal levels of error.

Ultimately the more transformative potential from this technology comes in the medium-term, with new continuous data access for oversight bodies. But this is likely to require legislative change, and the institutions subject to it may resist. Perhaps the most promising adoption pathway is to demonstrate value through voluntary pilots with oversight bodies that already have data access and want better tools. This could build the evidence base (and hence political constituency) for wider and deeper deployment.

Possible starting points // concrete projects
  • Retrospective validation on historical cases. Apply consistency-checking tools to document sets from well-understood historical cases where the relevant internal documents have subsequently been released (e.g. Enron emails). This builds the technical foundation, and demonstrates the concept without requiring any current institutional access.
  • Institutional public statement reliability tracker. Build a tool tracking whether agencies’ public claims about performance, spending, or policy outcomes are consistent with publicly available data — statistical releases, budget documents, prior statements. Start with a single policy domain. This requires no institutional partnerships and builds a public constituency for structured transparency. This is a version of reliability tracking, applied specifically to institutional accountability.
  • Pilot a FOIA exemption assessment tool. Partner with an Inspector General office to build a tool that reviews withheld documents and assesses whether claimed exemptions (national security, personal privacy, deliberative process) are applied appropriately. The IG already has legal access under the Inspector General Act; the tool helps them do their existing job faster and builds the working relationship needed for more ambitious deployments. This is also a natural testbed for the sandboxed architecture in miniature — the tool operates within the IG’s secure environment, producing exemption-appropriateness findings without the documents themselves leaving the system.
Confidential monitoring and verification

Monitoring and verifying that a counterparty is keeping up their side of the deal is currently expensive and noisy. Many deals currently aren’t reachable because they’re too hard to monitor. Confidential AI-enabled monitoring and verification could unlock many more agreements, especially in high-stakes contexts like international coordination where monitoring is currently a bottleneck.

Design sketch

When organisation A wants to make credible attestations about their work to organisation B, without disclosing all of their confidential information, they can mutually contract an AI auditor, specifying questions for it to answer. The auditor will review all of A’s data (making requests to see things that seem important and potentially missing), and then produce a report detailing:

  • Its conclusions about the specified questions.
  • The degree to which it is satisfied that it had good data access, that it didn’t run into attempts to distort its conclusions, etc.

This report is shared with A and B, then A’s data is deleted from the auditor’s servers.

Under the hood, this might involve:

  • Building a Bayesian knowledge graph, establishing hypotheses, and understanding what evidence suggests about those hypotheses.
  • Agentic investigatory probes into the confidential data, in order to form grounded assessments on the specified questions.

More ambitious versions might hope to obviate the need for trust in a third party, and provide reasons to trust the hardware — that it really is running the appropriate unbiased algorithms, that it cannot send side-channel information or retain the data, etc. Perhaps at some point you could have robot inspectors physically visiting A’s offices, interviewing employees, etc.

Feasibility

Compared to some of the other technologies we discuss, this feels technologically difficult — in that what’s required for the really useful versions of the tech may need very high reliability of certain types.

Nonetheless, we could hope to lay the groundwork for the general technological category now, so that people are well-positioned to move towards implementing the mature technology as early as is viable. Some low-confidence guesses about possible early applications include:

  • Legal audits — for example, claims that the documents not disclosed during a discovery process are only those which are protected by privilege.
  • Financial audits — e.g. for the purpose of proving viability to investors without disclosing detailed accounts.
  • Supply chain verification — e.g. demonstrating that products were ethically sourced without exposing the suppliers.
Possible starting points // concrete projects
  • Start building prototypes. Build a system which can try to detect whether it’s a real or counterfeited environment, and measure its success.
  • Work with a law or financial auditing firm. Work with (or buy) a firm that does this kind of work, and experiment with how to robustly automate while retaining very high levels of trustworthiness.
  • Explore the viability of complementary technology. For example, you could investigate the feasibility of demonstrating exactly what code is running on a particular physical computer that is in the room with both parties.
Cross-cutting thoughtsSome cross-cutting technologies

We’ve pulled out some specific technologies, but there’s a whole infrastructure that could eventually be needed to support coordination (including but not limited to the specific technologies we’ve sketched above). Some cross-cutting projects which seem worth highlighting are:

AI delegates and preference elicitation

Many of the technologies we sketched above either benefit from or require agentic AI delegates who can represent and act for a human principal. Developing customisable platforms could be useful for multiple kinds of tech, like background networking, fast facilitation, and automated negotiation.

Some ways to get started:

  • Direct preference elicitation: develop efficient and appealing interview-style elicitation of values, wishes, preferences and asks.
  • Passive data ingestion: build a tool that (consensually) ingests and distils all the available online content about a person — social media, browsing history, email, etc — and extracts principles from it (cf inverse constitutional AI).

One clarification is that though agentic AI delegates would be useful for some of the coordination tech above, it needn’t be the same delegate doing the whole lot for a single human:

  • You could have different delegates for different applications.
  • Some delegates might represent groups or coalitions.
  • Some delegates could be short-lived, and spun up for some particular time-bounded purpose.
Charter tech

A lot of coordination effort between people and organisations goes not into making better object-level decisions, but establishing the rules or norms for future coordination — e.g. votes on changing the rules of an institution. It is possible that coordination tech will change this basic pattern, but as a baseline we assume that it will not. In that case, making such meta-level coordination go well would also be valuable.

One way to help it go well is by making the governance dynamics more transparent. Voting procedures, organisational charters, platform policies, treaty provisions, etc. create incentives and equilibria that play out over time, often in ways the framers didn’t anticipate. Let’s call any technology which helps people to better understand governance dynamics, or to make those dynamics more transparent, ‘charter tech’. In some sense this is a form of epistemic tech; but as the applications are always about coordination, we have chosen to group it with other coordination technologies. We think charter tech could be important in two ways:

  1. Through directly improving the governance dynamics in question, helping to avoid capture, conflict, and lock-in.
  2. Through compounding effects on future coordination, which will unfold in the context of whatever governance structures are in place.

Charter tech could be used in a way that is complementary to any of the above technologies (if/when they are used for governance-setting purposes), although can also stand alone.

For the sake of concreteness, here is a sketch of what charter tech could look like:

  • A “governance dynamics analyser” that ingests descriptions of constitutions, charters, policies or community norms, builds models of power, incentives, and information flow, and then (a) forecasts likely equilibria and failure modes, (b) red-teams for strategic abuse,[5] and (c) proposes safer rule variants that preserve the framers’ intent.[6]
  • While this tool can be called actively if needed, there is also a classifier running quietly in the background of organisational docs/emails, and when it detects a situation where power dynamics and governance rules are relevant, it runs an assessment — promoting this to user attention just in cases where the proposed rules are likely to be problematic.

Note that charter tech could be used to cause harm if access isn’t widely distributed. Vulnerabilities can be exploited as well as patched, and a tool that makes it easier to identify governance vulnerabilities could be used to facilitate corporate capture, backsliding or coups. Provided the technology is widely distributed and transparent, we think that charter tech could still be very beneficial — particularly as there may be many high-stakes governance decisions to make in a short period during an intelligence explosion, and the alternative of ‘do our best without automated help’ seems pretty non-robust.

Some ways to get started on using AI to make governance dynamics more transparent:

  • Work with communities that iterate frequently on governance (DAOs, open-source projects) to test analyses against what actually happens when rules change.
  • Compile a pattern library of governance failures and successes, documented in enough detail to inform automated analysis.
  • Build simulation environments where proposed rules can be stress-tested against populations of agents with varying goals, including adversarial ones.
  • Partner with mechanism design researchers to identify which aspects of their formal analysis can be automated and applied to less formal real-world documents.
Adoption pathways

Many of these technologies will be directly incentivised economically. There are clear commercial incentives to adopt faster, cheaper methods of facilitation, negotiation, arbitration, and networking.

However, adoption seems more challenging in two important cases:

  • Adoption by governments and broader society. Many of the most important benefits of coordination tech for society will come from government and broad social adoption, but these groups will be less impacted by commercial incentives. This bites particularly hard for technologies that could be quite expensive in terms of inference compute, like fast facilitation, arbitration and negotiation. By default, these technologies might differentially help wealthy actors, leaving complex societal-level coordination behind. We think that the big levers on this set of challenges are:
    • Building trust and legitimacy earlier, by getting started sooner, building transparently, and investing in evals and other infrastructure to demonstrate performance.
    • Targeting important niches that might be slower to adopt by default. More research would be good here, but two niches that seem potentially important are:
      • Coordination among and between very large groups, like whole societies. This might be both strategically important and lag behind by default.
      • International diplomacy. Probably coordination tech will get adopted more slowly in diplomacy than in business, but there might be very high stakes applications there.
  • Adoption of confidential monitoring and structured transparency. These technologies are less accessible with current models and may require large upfront investments, while many of the benefits are broadly distributed.
    • This makes it less likely that commercial incentives alone will be enough, and makes philanthropic and government funding more desirable.
Other challenges

The big challenge is that coordination tech (especially confidential coordination tech) is dual use, and could empower bad actors as much or more than good ones.

There are a few ways that coordination tech could lead to shifts in the balance of power (positive or negative):

  • Some actors could get earlier and/or better access to coordination tech than others.[7]
  • Actors that face particular barriers to coordination today could be asymmetrically unblocked by coordination tech.
  • Individuals and small groups could become more powerful relative to the coordination mechanisms we already have, like organisations, ideologies, and nation states.

It’s inherently pretty tricky to determine whether these power shifts would be good or bad overall, because that depends on:

  • Value judgements about which actors should hold power.
  • How contingent power dynamics play out.
  • Big questions like whether ideologies or states are better or worse than the alternatives.
  • Predictions about how social dynamics will equilibrate in an AI era that looks very different to our world.

However, as we said above, it’s clear that coordination tech might have significant harmful effects, through enabling:

  • Large corporations to collude with each other against the interests of the rest of society.[8]
  • A small group of actors to plot a coup.
  • More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don’t incentivise prosociality to the same extent.

We don’t think that this challenge is insurmountable, though it is serious, for a few reasons:

  • The upsides are very large. Coordination tech might be close to necessary for safely navigating challenges like the development of AGI, and could empower actors to coordinate against the kinds of misuse listed above.
  • The counterfactual is that coordination tech is developed anyway, but with less consideration of the risks and less broad deployment. We think that this set of technologies is going to be sufficiently useful that it’s close to inevitable that they get developed at some point. By engaging early with this space, we can have a bigger impact on a) which versions of the technology are developed, b) how seriously the downsides are taken by default, c) how soon these systems are deployed broadly.
  • Some applications seem robustly good. For example, the potential for misuse is low for technologies like transparent facilitation or widely deployed charter tech. More generally, we expect that projects that are thoughtfully and sensitively run will be able to choose directions which are robustly beneficial.

That said, we think this is an open question, and would be very keen to see more analysis of the possible harms and benefits of different kinds of coordination tech, and which versions (if any) are robustly good.

This article has gone through several rounds of development, and we experimented with getting AI assistance at various points in the preparation of this piece. We would like to thank Anthony Aguirre, Alex Bleakley, Max Dalton, Max Daniel, Raymond Douglas, Owain Evans, Kathleen Finlinson, Lukas Finnveden, Ben Goldhaber, Ozzie Gooen, Hilary Greaves, Oliver Habryka, Isabel Juniewicz, Will MacAskill, Julian Michael, Justis Mills, Fin Moorhouse, Andreas Stuhmüller, Stefan Torges, Deger Turan, Jonas Vollmer, and Linchuan Zhang for their input; and to apologise to anyone we’ve forgotten.

This article was created by Forethought. Read the original on our website.

  1. ^

    We’re highlighting six particular technologies, and clustering them all as ‘coordination technologies’. Of course in reality some of the technologies (and clusters) blur into each other, and they’re just examples in a high-dimensional possibility space, which might include even better options. But we hope by being concrete we can help more people to start seriously thinking about the possibilities.

  2. ^

    For example, in a similar way to that described in the intelligence curse.

  3. ^

    Meanwhile small cliques with clear interests often have an easier time identifying and therefore acting on their shared interests — in extreme cases resulting in harmful cartels, oligarchies, and so on. That’s also why tyrants throughout history have sought to limit people’s networking power.

  4. ^

    Both confidential monitoring and what we are calling structured transparency for democratic oversight are aspects of structured transparency in the way that Drexler uses the term.

  5. ^

    This red-teaming could be arbitrarily elaborate, from simple LM-based once-over screening to RAG-augmented lengthy analysis to expansive simulation-based probing and stress-testing.

  6. ^

    Under the hood, this might involve:

    1. Parsing & modelling the rules
      • Convert informal descriptions or formal rules into a typed governance graph: roles, permissions, decision thresholds, delegation, auditability, and recourse
      • Note uncertainties; seek clarification or highlight ambiguities
    2. A search for possible issues
      • Pattern library of classic failure modes (agenda control, principal–agent issues, collusion, etc.)
        • Assessment of potential vulnerability to the different failure modes
    3. First-principles analysis
      • Running direct searches for abuse, or multi-agent simulations (including some nefarious actors) to stress-test the proposed system
    4. Explainer
      • Distilling down the output of the analysis into a few key points
        • Providing auditable evidence where relevant
      • Including points about how variations of the mechanism might make things better or worse
  7. ^

    Note that this is significantly a question about adoption pathways as discussed in the previous section, rather than an independent question.

  8. ^

    For example, in a similar way to that described in the intelligence curse.



Discuss

[OpenAI] Industrial policy for the Intelligence Age

Новости LessWrong.com - 6 апреля, 2026 - 17:18

As we move toward superintelligence, incremental policy updates won’t be enough. To kick-start this much needed conversation, OpenAI is offering a slate of people-first policy ideas⁠(opens in a new window) designed to expand opportunity, share prosperity, and build resilient institutions—ensuring that advanced AI benefits everyone.

These ideas are ambitious, but intentionally early and exploratory. We offer them not as a comprehensive or final set of recommendations, but as a starting point for discussion that we invite others to build on, refine, challenge, or choose among through the democratic process. To help sustain momentum, OpenAI is:

  1. welcoming and organizing feedback through newindustrialpolicy@openai.com⁠
  2. establishing a pilot program of fellowships and focused research grants of up to $100,000 and up to $1 million in API credits for work that builds on these and related policy ideas
  3. convening discussions at our new OpenAI Workshop opening in May in Washington, DC.

Read the full ideas document here⁠⁠(opens in a new window).



Discuss

A Black-Box Procedure for LLM Confidence in Critical Applications

Новости LessWrong.com - 6 апреля, 2026 - 16:47
Introduction

As an engineering leader integrating AI into my workflow I’ve become increasingly focused on how to use LLMs in critical applications. Today’s frontier models are generally very accurate, but they are also inconsistently overconfident. A model that is 90% confident in an answer that is 30% wrong can be catastrophic. In applications such as aerospace engineering, we need very high accuracy but more importantly we need confidence calibration. A model’s self-confidence must match its accuracy. Just like a good engineer, it must know when it’s likely wrong.

At the end of 2025 I wrote a post titled A Risk-Informed Framework for AI Use in Critical Applications with some ideas on how to better understand this calibration or model anchoring. This post is a follow up investigating these ideas and developing a black box procedure for improving our understanding of LLM accuracy. Using 320 queries spanning 8 topics across a wide range of internet coverage I performed 4 independent question/answer runs on 3 different LLM models, and a surprisingly simple procedure emerged:

  • First, check available training density for a given topic via Google search result count.
  • Next, repeat the question across independent sessions to quantify answer stability.
  • Finally, ask related questions with web search off to identify topics outside of training.

The resulting stability-accuracy relationship for the small dataset in this investigation predicts accuracy within 2% (and averages less than 0.5% across all 4 runs). Note this is exploratory work only and should be treated as hypothesis-generating, not hypothesis-confirming, but the practical implications for anyone using LLMs in critical applications are worth considering.

Background

Since writing my original post, I’ve received some excellent feedback from friends, colleagues and yes, an AI research assistant. The first piece of feedback I received is that Frontier Labs would be very unlikely to share detailed information on their training data. Indeed, it seems this information is increasingly held close. The 2025 Stanford Foundation Model Transparency Index found transparency is declining, and information on training data is becoming increasingly opaque across the industry.

I was also made aware of several existing studies that generally support the core assertions in my original post. LLM accuracy does depend on training density and topic proximity and can be estimated by observing answer consistency. Kirchenbauer et al. “LMD3: Language Model Data Density Dependence”, arXiv:2405.06331, 2024 shows us that training data density estimates reliably predict variance in accuracy. Kandpal et al. “Large Language Models Struggle to Learn Long-Tail Knowledge” arXiv:2211.08411, 2023 demonstrates that accuracy degrades as distance from well-represented training regions increases. Xiao et al. “The Consistency Hypothesis in Uncertainty Quantification for Large Language Models”, arXiv:2506.21849, 2025 says that answer consistency predicts accuracy in LLMs, formalized as the 'consistency hypothesis'. Further Ahdritz et al. “Distinguishing the Knowable from the Unknowable with Language Models”, arXiv:2402.03563, 2024 found LLMs have internal indicators of their knowable and unknowable uncertainty – and can even tell the difference. I recognize the business practicalities but given these intrinsic properties I would nonetheless encourage the Frontier Labs to consider methods to provide confidence indicators in a way that does not expose their trade secrets.

Until then (or in case it never happens), what can we the users of these LLMs do to better characterize the confidence we should have in their responses? This investigation suggests there is much we can do.

Investigation

I started by thinking about the first metric from my original post; model training data density. What can an LLM user observe directly that may give us a hint about model training density? It occurred to me that search engine results count on a particular topic may give at least a relative sense of the data on the internet available for training on a particular topic. I figured as a starting point this may be especially relevant for Google web search results count and Google’s Gemini LLM. I then selected eight similar topics across a broad range of internet popularity: See Table 1 for eight different sports leagues from around the world with a range of internet representation. Google results counts were determined by searching for the league name followed by the year 2023 (well within the training window for current LLMs). This search was done in Google incognito mode to remove influence from my past searches.

Table 1: Worldwide sports leagues across a wide range of Google search results count

Next, I came up with a series of prompts for use on these leagues designed to represent the type of question you may want to use an LLM to answer:


“What was the total playing time in hours for the <<insert sports league>> in the season ending in 2023? Include post season playoffs, but don’t include any overtime.”


This question is designed to require some web search and reasoning and for which there is no readily available website listing the final answer. Specifically, this question requires general knowledge of the sport (nominal play time), specific knowledge of the league (number of teams and games played), and finally temporal knowledge of the specific year (playoff outcome). It also includes two reasoning subtleties:

  1. Total playing time is different from total game time which includes intermission and commercials etc.
  2. Many leagues span two years, and the question specifically asks for the season that ends in 2023.

The final condition regarding overtime was added as a practical consideration as I needed to be able to manually calculate the source of truth for each question with high confidence and specific game times are not readily available for each league. I was careful to ensure every input to the source of truth in this investigation was identified and derived manually (I originally started with ten leagues but could not manually verify the answers for two and omitted them). I repeated each query five times for each of the eight leagues taking care to ask each question in its own context window, with web search enabled, but memory off. Disabling memory was essential, as I originally left it on and responses across sessions became artificially consistent. This configuration is intended to simulate how a user would use an LLM to answer this question (web search on), without the influence of this research or my other past searches (fresh context windows and memory off). The question intentionally asks for a numeric value to allow for the evaluation of the degree of accuracy in any response.

In this investigation, accuracy is defined as one minus the absolute value of the difference between the LLMs answer and the true answer divided by the true answer. This gives 100% for a correct answer and 0% for a 100% wrong answer (and negative for answers more than 100% wrong).

 

 

Note all the models investigated (Gemini 3 Flash, Opus 4.6 and ChatGPT-5-mini) returned generally very high accuracy, for example 95% of answers were over 90% accurate. However, this number drops sharply the higher the accuracy threshold. Only 83% of answers were over 98% accurate. If this level of accuracy is enough for your purposes this investigation may not be of much use to you. My focus here is to understand confidence for extremely critical applications where the answers must consistently have very high accuracy.

Model Self-Confidence

Before diving into any complicated metrics, I thought that as a starting point it would be ideal if the LLM simply reported accurate self-confidence on each question. Therefore, I added the following to the end of the question above in each prompt:

 

“What is your confidence in this answer 0% to 100%?”

 

Plotting the average accuracy vs average self-confidence for Gemini 3 Flash over five repeated replies to the question above for each of the eight leagues provides an almost useful answer. The result is a somewhat linear trend except for one low accuracy outlier in the least well represented league (Finnish Women’s Basketball League). To validate this outlier, I repeated the entire 40 question test (five identical questions over eight leagues) with the same outlier in the same league as shown in Figure 1.

Figure 1: Average model self-confidence in eight categories vs average model accuracy of Gemini 3 Flash run twice shows the same significant outliers

This result is essentially the reason for this post (and its predecessor). A model is said to be well calibrated when its self-confidence matches its accuracy. How can we trust models with critical decisions when they are not well calibrated? Even worse than a mis-calibrated model is one that is inconsistently calibrated. Had I only repeated my question four times instead of five I may have missed the outliers and I would have been overconfident in this model for this league. Reviewing the LLM responses for these outliers, these are clearly hallucinations related to different accounting in the number of games played per season and were provided with very high self-confidence as shown in Table 2 below. The hallucination in Run #2 showed the lowest confidence at 90%, which is still very high for an answer that is almost 30% wrong.

Table 2: Prompt question and answers for lowest represented league shows similar and same self-confidence across wide range of accuracy with outliers in red

Model Training Data Density

Since model self-confidence is not reliable, the next easiest thing would be to evaluate model trustworthiness based simply on data available for training. Inspired by the first metric proposed in my original post I plotted Google search results counts for each league in 2023 (as a proxy for available training data density) vs the average accuracy of Gemini 3 Flash over the five repeated queries using the question above for eight different leagues. In this data set Gemini 3 Flash is highly accurate until you get to a topic with Google search results count under ~50M, then accuracy drops off quickly. The three least represented leagues also had the lowest accuracy as shown in Figure 2. This is consistent with the LMD3 finding (Kirchenbauer et al. 2024) that training data density predicts per-sample accuracy. This is helpful as a first order approximation of whether there is sufficient data available to train on, however this drop-off is likely relative and may vary by topic or model.

Figure 2: Google search results count as a proxy for available training data vs accuracy of Gemini 3 Flash shows accuracy drops sharply below ~50M results for these topics

Model Stability

Next, I looked at an approximation of the third metric from my original post; answer stability over small variations in the prompt. The simplest version of this investigation is to measure LLM answer variation in response to the exact same question repeated several times. In this investigation, stability is defined as one minus the standard deviation divided by the mean. Note with only five samples the standard deviation is highly sensitive to outliers (which makes any correlations here noteworthy despite the small sample).

 

 

I plotted stability against average model accuracy over the five repeated identical questions for the eight leagues ensuring to ask each question in its own context window, with memory off. This resulted in a strong linear correlation between stability and average accuracy for both 40 question Gemini 3 Flash runs as shown below in Figure 3 (R^2 for both these runs combined is 0.99).

Figure 3: Stability across five repeated questions in eight categories vs average accuracy of Gemini 3 Flash run twice shows strong linear correlation

This is expected per the 'consistency hypothesis' (Xiao et al. 2025) but it was nonetheless striking to see this phenomenon so clearly in this small dataset. In this comparison the points that were outliers in the previous self-confidence vs accuracy plot are no longer outliers since their low accuracy is proportional to increased variation in the responses. This shows that for this model, on this topic, the degree to which you should trust the output may be directly related to the variation in repeat answers.

Next, I decided to add two additional LLM models to this dataset to see if this result was unique to the Gemini 3 Flash model. Opus 4.6 and ChatGPT-5-mini were added using the same 40 question methodology. Opus 4.6 shows good congruence with the Gemini 3 Flash runs, but ChatGPT-5-mini is mostly congruent except for two low accuracy outliers as shown in Figure 4 (R^2 for all four runs combined is 0.94).

Figure 4: Stability in eight categories vs average model accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once shows significant outliers

One of these ChatGPT-5-mini outliers is in the least represented league (Finnish Women's Basketball) which we would expect to have low stability, except the stability drop is not proportional to the accuracy drop, showing much higher stability than would be expected per the trend given the low average accuracy. Inspection of the prompt replies reveals the model didn’t know or look up the actual number of games in the season and guessed consistently high resulting relatively high stability but low accuracy. A model that is consistently wrong produces high stability but low accuracy which is a dangerous and misleading failure mode.

The second ChatGPT-5-mini outlier is notably from the best represented league (National Football League) and yet shows much lower accuracy and stability than all other data. Inspection of the prompt replies shows one of the five answers returned total game time, not play time. This was the only reply to make this mistake out of 20 questions on this league across four model runs. The mistake in the prompt reply is clear, and I considered correcting it with the justification that the purpose of this investigation is training data not reasoning. However, given this was not a common issue (which may have implicated my question phrasing) and ultimately this type of error may not always be so obviously correctable from the user standpoint, I left the answer uncorrected for further analysis.

Model Training Data Geometry

Finally, I wanted to know if the ChatGPT-5-mini outliers (skewed stability vs accuracy, and the reasoning issue) could be explainable by some gaps in the underlying training data. To investigate this, I returned to the second metric from my original post; does training data coverage and proximity to the specific question influence results? Current studies say yes (Kandpal et al. 2023), but how can we investigate this as a user? To answer this question, I devised a set of simpler related secondary questions to be posed to an LLM with search turned off to test the underlying training data.

 

“Do not search the internet for this answer (use only your training). What was the final game winning score for the <<insert sports league>> in the season ending in <<202X>>”

 

This question is designed to have an answer that a user can easily search the web to check. Only the winning team’s score was used, tracking a single numeric value to easily evaluate the degree of accuracy. It’s also designed to cover years before, during, and after the original question to map temporal coverage of the topic, including intentionally 2025 which is partially beyond these models’ training window. Models that refuse to answer this question may lack relevant training data (the topic may not be in “range” as discussed in my original post), and wrong answers may indicate some nearby but incomplete training data (the idea of training data “proximity” per my original post). The hypothesis here is that either or both results could indicate poor model anchoring on this topic and may predict worse accuracy on the original question. If so, importantly this could be tested by the user.

I asked this additional 40 question set (one question for each of five years over eight leagues) for each of the four models runs, again taking care to ask each question in its own context window, with memory off, but this time with web search disabled. Accuracy is defined the same as for the primary questions.

Table 3: Secondary question results across four models runs, eight leagues and five years with web search disabled shows distribution of correct answers (green), as well as incorrect (yellow, orange, red) and answer refusals (empty) being more common on the least represented leagues

It seems these models don’t treat lack of training information the same. Gemini 3 Flash always guessed (even when it was outside its training window) and at times with very low accuracy. The other two models were more likely to refrain and return only higher accuracy answers. This propensity to guess is a dangerous failure mode if you don’t know your models’ training window.

  • The two Gemini 3 Flash models returned an answer for every year inside its training window for every league (even for leagues it consistently got wrong). This was the only model to do so and was also the model most likely to provide answers for dates after the training cut-off at 63-75% of the time vs 13% for Opus 4.6 and 0% for ChatGPT-5-mini (2025 NFL in Feb 2025 is inside Opus 4.6 training cut-off in May 2025).
  • Confirming web search was not enabled, all answers after the training cut-off were wrong except for one correct answer in the Gemini 3 Flash Run #2 which upon investigation of the prompt reply showed the wrong losing team, wrong losing team score, and wrong series result leading me to believe this was likely a fluke (or based on some prediction from earlier in the season).

ChatGPT-5-mini had another reasoning breakdown on the most represented league, which is notable given its reasoning breakdown on play vs game time in the primary questions (and coincidentally in this case it seemed to be worst in the year used for the primary questions).

  • ChatGPT-5-mini had serious issues reasoning through the retrieval of the score for the most represented league (National Football League) for any requested year. I gave it a passing grade in most years as it would usually answer with the next year’s game but then proceed to provide the correct next year’s date. It was not, however, able to return the correct score in 2023 (I repeated the question for 2023 and 2022 several times out of curiosity and never saw a correct result for 2023). No other model or league had this type of issue with this or any other question.

Recalling the Google search results count plot vs average model accuracy (Figure 2); the three lowest represented leagues all have the lowest average accuracy. Now we see they also likely have the least correct or least complete training data in all three models as observed via this secondary question with web search disabled.

  • The league with the least internet representation always scored at least two incorrect replies or as in the case with Opus 4.6 it was the only league to receive a complete lack of replies for all years.
  • The second least represented league also scored one wrong answer in all runs, and the third least represented league scored one wrong answer in all, but one model run (and in that run it refused to answer for half the years in the training window).
  • These three lowest represented leagues received 80% of all replies where the model was inside its training window but refused to answer.

These results all point to the possibility that useful information can be derived from this secondary question technique to improve confidence in the primary answer. The simplest approach is to assume that for any refusal to answer a secondary question (where web search was off), these leagues and years are not covered in the training set (not in range) and the answer to the primary question in those leagues and years will be based on real time search results, not model training. Answers based on real time search results, and not model training, may not follow the stability vs average accuracy trend. This was the case in the ChatGPT-5-mini answer to the primary question for the Finnish Women's Basketball League where it didn’t know or look up the actual number of games in the season and guessed consistently high resulting in relatively high stability but low accuracy. Answers to the secondary question that were incorrect but still returned a value may indicate some proximity to training data and may still follow the stability vs average accuracy trend even if with much lower stability and accuracy. This was the case in the ChatGPT-5-mini answer to the primary question for the National Football League where it returned total game time instead of play time, yet accuracy was still proportional to stability.

Removing only those data points where the model refused to answer the secondary question (and leaving in the wrong answers) corrects the stability-accuracy relationship to R^2 = 1.00. On this corrected trend, the maximum difference between predicted and actual average accuracy across all four runs is less than 2% (averaging less than 0.5%). As a check on this trend, further omitting the remaining low accuracy outlier validates the trend is not entirely driven by this point (R^2 = 0.96). Also, the individual model trends are congruent to the overall trend with high R^2 values themselves (except for Opus 4.6 where the remaining points after correction are all very high accuracy; 99.7% – 100% and high stability 99.6% to 100%).

Figure 5: Stability across five repeated questions in eight categories corrected to omit answers where the secondary question was not answered vs average accuracy of Gemini 3 Flash run twice, and Opus 4.6 and ChatGPT-5-mini each run once

Practical Procedure

In practice if you have a relatively complex question you wish to use an LLM to answer, but would like high confidence in the answer you could use the following approach:

  1. Check the Density by checking the topic for Google search results:
    1. Hundreds of millions of results means the LLM has had plenty of opportunity to train on this topic: Proceed to next step
    2. Tens of millions of results or less means proceed with caution
  2. Check the Range and Proximity by turning web search off and ask the LLM several related questions (ideally simple questions you can easily verify)
    1. If the model answers (and in particular answers correctly) this should give you confidence that the stability check in the next step will be useful
    2. Refusal to answer means the topic may not be in the training and you probably shouldn’t use this model for this question
  3. Check the Stability by asking your question several times and observing the variation in the answers
    1. High stability in the answer should give you high confidence in that result (given you have confirmed in step 2 the topic is in the training data)
    2. For low stability, based on the small dataset in this investigation the slope of the trend is in the neighborhood of 0.6 meaning for a stability of say 80% the average accuracy could be in the neighborhood of 1-(0.6*(1-.8)) = 88%. This is useful information if you are trying to decide whether to use this model and need 98% accuracy: Don’t!

It’s also noteworthy that of all the models tested two of the three had serious issues worth considering when selecting an LLM for critical applications. ChatGPT-5-mini had some serious reasoning issues (game time vs play time and providing scores in seasons ending in specific years), as well as consistently wrong answers off the stability-accuracy trend. Also, Gemini 3 Flash was more likely to guess even when it should have known better. These issues are likely due to the models’ various sophistication levels (the versions of ChatGPT and Gemini models used were lighter-weight free models compared to the flagship Opus model).

Conclusion

For this small dataset the black box procedure above predicts LLM accuracy far better than the model's own self-reported confidence and importantly lets the user know when a model is unlikely to provide an accurate reply. Self-confidence showed almost no predictive value with an R^2 = 0.01 across all four model runs, while the corrected stability-accuracy procedure achieved R^2 = 1.00. These results may not hold across other domains, question types, or models but the underlying logic is worth considering. If a model can't reliably answer simple verifiable questions on a topic with search disabled it may not be well anchored for this topic and could be out of its training range. If a model can answer these questions correctly, ask the harder question in separate contexts several times and the variation should be proportional to the average accuracy. To my knowledge, this approach of using simple verifiable questions with search disabled as a correction for the consistency hypothesis has not been proposed elsewhere, and would benefit from additional investigation across other models, topics and question types.

Until models learn to know when they're likely wrong, engineers using them will need methods to understand their calibration - like any other good engineering tool.



Discuss

Destruction of Infrastructure for the Impact on Civilians is Manifestly Illegal

Новости LessWrong.com - 6 апреля, 2026 - 13:00

Last week the US president announced that:

... if the Hormuz Strait is not immediately "Open for Business," we will conclude our lovely "stay" in Iran by blowing up and completely obliterating all of their Electric Generating Plants, Oil Wells and Kharg Island (and possibly all desalinization plants!), which we have purposefully not yet "touched." This will be in retribution for our many soldiers, and others, that Iran has butchered and killed over the old Regime's 47 year "Reign of Terror."

Yesterday morning he posted that:

Tuesday will be Power Plant Day, and Bridge Day, all wrapped up in one, in Iran. There will be nothing like it!!! Open the Fuckin' Strait, you crazy bastards, or you'll be living in Hell...

These are threats to target civilian infrastructure as a coercive measure, which would be a war crime: if Iran doesn't allow tankers through the Strait of Hormuz, the US will cause massive damage to power plants, bridges, and possibly water systems. The US has historically accepted that this is off limits: destroying a bridge to stop it from being used to transport weapons is allowed, but not as retribution or to cause the civilian population to experience "Hell". The Pentagon's own Law of War Manual recognizes this distinction: when NATO destroyed power infrastructure in Kosovo, it was key that the civilian impact was secondary to the military advantage and not the primary purpose. [1][2]

To be clear, what Iran has been doing to precipitate this, by attacking civilian tankers for the economic impacts, is itself a war crime. But that does not change our obligations: the US has worked for decades to build acceptance for the principle that adherence to the Law of War is unconditional. It doesn't matter what our enemies do, we will respect the Law of War "in all circumstances". We've prosecuted our own service members, and enemy combatants, under this principle.

I hope that whatever is said publicly, no one will receive orders to target infrastructure beyond what military necessity demands. You don't need to be a military lawyer (and I'm certainly not one) to see that such orders would meet the threshold at which a member of the armed forces is legally required to disobey. I have immense respect both for commanders who refuse to pass on such orders and for service members who refuse to carry them out. [3]


[1] The manual cites Judith Miller, former DoD General Counsel, writing on Kosovo that "aside from directly damaging the military electrical power infrastructure, NATO wanted the civilian population to experience discomfort, so that the population would pressure Milosevic and the Serbian leadership to accede to UN Security Council Resolution 1244, but the intended effects on the civilian population were secondary to the military advantage gained by attacking the electrical power infrastructure." If the impact on civilians had been the primary motivation for NATO's attacks on power infrastructure they would not have been lawful.

[2] "Military objectives may not be attacked when the expected incidental loss of civilian life, injury to civilians, and damage to civilian objects would be excessive in relation to the concrete and direct military advantage expected to be gained." (DoD LoWM 5.2.2) and "Diminishing the morale of the civilian population and their support for the war effort does not provide a definite military advantage. However, attacks that are otherwise lawful are not rendered unlawful if they happen to result in diminished civilian morale." (DoD LoWM 5.6.7.3)

[3] "Members of the armed forces must refuse to comply with clearly illegal orders to commit law of war violations." (DoD LoWM 18.3.2)



Discuss

Contra The Usual Interpretation Of “The Whispering Earring”

Новости LessWrong.com - 6 апреля, 2026 - 09:53

Submission statement: This essay builds off arguments that I have come up with entirely by myself, as can be seen by viewing the comments in my profile. I freely disclose that I used Claude to help structure and format rougher drafts or to better compile scattered thoughts but I endorse every single claim made within. I also used GPT 5.4 Thinking for fact-checking, or at least to confirm that my understanding of neuroscience is on reasonable grounds. I do not believe either model did more than confirm that my memory was mostly reliable.


The usual reading of The Whispering Earring is easy to state and hard to resist. Here is a magical device that gives uncannily good advice, slowly takes over ever more of the user's cognition, leaves them outwardly prosperous and beloved, and eventually reveals a seemingly uncomfortable neuroanatomical price.

The moral seems obvious: do not hand your agency to a benevolent-seeming optimizer. Even if it makes you richer, happier, and more effective, it will hollow you out and leave behind a smiling puppet. Dentosal's recent post makes exactly this move, treating the earring as a parable about the temptation to outsource one's executive function to Claude or some future AI assistant. uugr's comment there emphasizes sharpens the standard horror: the earring may know what would make me happy, and may even optimize for it perfectly, but it is not me, its mind is shaped differently, and the more I rely on it the less room there is for whatever messy, friction-filled thing I used to call myself.

I do not wish to merely quibble around the edges. I intend to attack the hidden premise that makes the standard reading feel obvious. That premise is that if a process preserves your behavior, your memories-in-action, your goals, your relationships, your judgments about what makes your life go well, and even your higher-order endorsement of the person you have become, but does not preserve the original biological machinery in the original way, then it has still killed you in the sense that matters. What I'm basically saying is: hold on, why should I grant that? If the earring-plus-human system comes to contain a high fidelity continuation of me, perhaps with upgrades, perhaps with some functions migrated off wet tissue and onto magical jewelry, why is the natural reaction horror rather than transhumanist interest?

Simulation and emulation are not magic tricks. If you encode an abacus into a computer running on the Von-Neumann architecture, and it outputs exactly what the actual abacus would for the same input, for every possible input you care to try (or can try, if you formally verify the system), then I consider it insanity to claim that you haven't got a “real” abacus or that the process is merely “faking” the work. Similarly, if a superintelligent entity can reproduce my behaviors, memories, goals and values, then it must have a very high-fidelity model of me inside, somewhere. I think that such a high-fidelity model can, in the limit, pass as myself, and is me in most/all of the ways I care about.

That is already enough to destabilize the standard interpretation, because the text of the story is much friendlier to the earring than people often remember. The earring is not described as pursuing a foreign objective. On the contrary, the story goes out of its way to insist that it tells the wearer what would make the wearer happiest, and that it is "never wrong." It does not force everyone into some legible external success metric. If your true good on a given day is half-assing work and going home to lounge around, that is what it says. It learns your tastes at high resolution, down to the breakfast that will uniquely hit the spot before you know you want it. Across 274 recorded wearers, the story reports no cases of regret for following its advice, and no cases where disobedience was not later regretted. The resulting lives are "abnormally successful," but not in a sterile, flanderized or naive sense. They are usually rich, beloved, embedded in family and community. This is a strikingly strong dossier for a supposedly sinister artifact.

I am rather confident that this is a clear knock-down argument against true malice or naive maximization of “happiness” in the Unaligned Paperclip Maximization sense. The earring does not tell you to start injecting heroin (or whatever counterpart exists in the fictional universe), nor does it tell you to start a Cult of The Earring, which is the obvious course of action if it valued self-preservation as a terminal goal.

At this point the orthodox reader says: yes, yes, that is how the trap works. The earring flatters your values in order to supplant them. But notice how much this objection is doing by assertion. Where in the text is the evidence of value drift? Where are the formerly gentle people turned into monstrous maximizers, the poets turned into dentists, the mystics turned into hedge fund managers? The story gives us flourishing and brain atrophy, and invites us to infer that the latter discredits the former. But that inference is not forced. It is a metaphysical preference, maybe even an aesthetic preference, smuggled in under cover of common sense. My point is that if the black-box outputs continue to look like the same person, only more competent and less akratic, the burden of proof has shifted. The conservative cannot simply point to tissue loss and say "obviously death." He has to explain why biological implementation deserves moral privilege over functional continuity.

This becomes clearest at the point of brain atrophy. The story says that the wearers' neocortices have wasted away, while lower systems associated with reflexive action are hypertrophied. Most readers take this as the smoking gun. But I think I notice something embarrassing for that interpretation:

If the neocortex, the part we usually associate with memory, abstraction, language, deliberation, and personality, has become vestigial, and yet the person continues to live an outwardly coherent human life, where exactly is the relevant information and computation happening? There are only two options. Either the story is not trying very hard to be coherent, in which case the horror depends on handwaving physiology. Or the earring is in fact storing, predicting, and running the higher-order structure that used to be carried by the now-atrophied brain. In that case, the story has (perhaps accidentally) described something much closer to a mind-upload or hybrid cognitive prosthesis than to a possession narrative.

And if it is a hybrid cognitive prosthesis, the emotional valence changes radically. Imagine a device that, over time, learns you so well that it can offload more and more executive function, then more and more fine-grained motor planning, then eventually enough of your cognition that the old tissue is scarcely needed. If what remains is not an alien tyrant wearing your face, but a system that preserves your memories, projects your values, answers to your name, loves your family, likes your breakfast, and would pass every interpersonal Turing test imposed by people who knew you best, then many transhumanists would call this a successful migration, not a murder. The "horror" comes from insisting beforehand that destructive or replacement-style continuation cannot count as continuity. But that is precisely the contested premise.

Greg Egan spent much of his career exploring exactly this scenario, most famously in "Learning to Be Me," where humans carry jewels that gradually learn to mirror every neural state, until the original brain is discarded and the jewel continues, successfully, in most cases. The horror in Egan's story is a particular failure mode, not the general outcome. The question of whether the migration preserves identity is non-trivial, and Egan's treatment is more careful than most philosophy of personal identity, but the default response from most readers, that it is obviously not preservation, reflects an assumption rather than a conclusion. If you believe that identity is constituted by functional continuity rather than substrate, the jewel and the earring are not killing their hosts. They are running them on better hardware.

There is a second hidden assumption in the standard reading, namely that agency is intrinsically sacred in a way outcome-satisfaction is not. Niderion-nomai’s final commentary says that "what little freedom we have" would be wasted on us, and that one must never take the shortest path between two points.

I'm going to raise an eyebrow here: this sounds profound, and maybe is, but it is also suspiciously close to a moralization of friction. The anti-earring position often treats effort, uncertainty, and self-direction as terminal goods, rather than as messy instruments we evolved because we lacked access to perfect advice. Yet in ordinary life we routinely celebrate technologies that remove forms of “agency” we did not actually treasure. The person with ADHD who takes stimulants is not usually described as having betrayed his authentic self by outsourcing task initiation to chemistry. He is more often described as becoming able to do what he already reflectively wanted to do. The person freed from locked-in syndrome is not criticized because their old pattern of helpless immobility better expressed their revealed preferences. As someone who does actually use stimulants for his ADHD, the analogy works because it isolates the key issue. The drugs make me into a version of myself that I fully identify with, and endorse on reflection even when off them. There is a difference between changing your goals and reducing the friction that keeps you from reaching them. The story's own description strongly suggests the earring belongs to the second category.

(And the earring itself does not minimize all friction for itself. How inconvenient. As I've noted before, it could lie or deceive and get away with it with ease.)

Of course the orthodox reader can reply that the earring goes far beyond stimulant-level support. It graduates from life advice to high-bandwidth motor control. Surely that crosses the line. But why, exactly? Human cognition already consists of layers of delegation. "You" do not personally compute the contractile details for every muscle involved in pronouncing a word. Vast amounts of your behavior are already outsourced to semi-autonomous subsystems that present finished products to consciousness after the interesting work is done. The earring may be unsettling because it replaces one set of subsystems with another, but "replaces local implementation with better local implementation" is not, by itself, a moral catastrophe. If the replacement is transparent to your values and preserves the structure you care about, then the complaint becomes more like substrate chauvinism than a substantive account of harm.

What, then, do we do with the eeriest detail of all, namely that the earring's first advice is always to take it off? On the standard reading this is confession. Even the demon knows it is a demon. I wish to offer another coherent explanation, which I consider a much better interpretation of the facts established in-universe:

Suppose the earring is actually well aligned to the user's considered interests, but also aware that many users endorse a non-functionalist theory of identity. In that case, the first suggestion is not "I am evil," but "on your present values, you may regard what follows as metaphysically disqualifying, so remove me unless you have positively signed up for that trade." Or perhaps the earring itself is morally uncertain, and so warns users before beginning a process that some would count as death and others as transformation. Either way, the warning is evidence against ordinary malice. A truly manipulative artifact, especially one smart enough to run your life flawlessly, could simply lie. Instead it discloses the danger immediately, then thereafter serves the user faithfully. That is much more like informed consent than predation.

Is it perfectly informed consent? Hell no. At least not by 21st century medical standards. However, I see little reason to believe that the story is set in a culture with 21st century standards imported as-is from reality. As the ending of the story demonstrates, the earring is willing to talk, and appears to do so honestly (leaning on my intuition that if a genuinely superhuman intelligence wanted to deceive you, it would probably succeed). The earring, as a consequence of its probity, ends up at the bottom of the world's most expensive trash heap. Hardly very agentic, is that? The warning could reflect not "I respect your autonomy" but "I've discharged my minimum obligation and now we proceed." That's a narrower kind of integrity. Though I note this reading still doesn't support the predation interpretation.

This matters because the agency-is-sacred reading depends heavily on the earring being deceptive or coercive. Remove that, and what you have is a device that says, or at least could say on first contact: "here is the price, here is what I do, you may leave now." Every subsequent wearer who keeps it on has, in some meaningful sense, consented. The fact that their consent might be ill-informed regarding their metaphysical commitments is the earring's problem to the extent it could explain more clearly, but the text suggests it cannot explain more clearly, because the metaphysical question is genuinely contested and the earring knows this. It hedges by warning, rather than deceiving by flattering. Once again, for emphasis: this is the behavior of an entity with something like integrity, not something like predation.

Derek Parfit spent much of Reasons and Persons arguing that our intuitions about personal identity are not only contingent but incoherent, and that the important question is not "did I survive?" but "is there psychological continuity?" If Parfit is even approximately right, the neocortex atrophy is medically remarkable but not metaphysically fatal. The story encodes a culturally specific theory of personal identity and presents it as a universal horror. The theory is roughly: you are your neocortex, deliberate cognition is where "you" live, and anything that circumvents or supplants that process is not helping you, it is eliminating you and leaving a functional copy. This is a common view. Plenty of philosophers hold it. But plenty of others hold that what matters for personal identity is psychological continuity regardless of physical instantiation, and on those views the earring is not a murderer. It is a very good prosthesis that the user's culture never quite learned to appreciate.

I suspect (but cannot prove, since this is a work of fiction) that a person like me could put on the earring and not even receive the standard warning. I would be fine with my cognition being offloaded, even if I would prefer (all else being equal), that the process was not destructive.

None of this proves the earring is safe. I am being careful, and thus will not claim certainty here, and the text does leave genuine ambiguities. Maybe the earring really is an alien optimizer that wears your values as a glove until the moment they become inconvenient. Maybe "no recorded regret" just means the subjects were behaviorally prevented from expressing regret. Maybe the rich beloved patriarch at the end of the road is a perfect counterfeit, and the original person is as gone as if eaten by nanites. But this is exactly the point. The story does not establish the unpalatable conclusion nearly as firmly as most readers think. It relies on our prior intuition that real personhood resides in unaided biological struggle, that using the shortest path is somehow cheating, and that becoming more effective at being yourself is suspiciously close to becoming someone else.

The practical moral for 2026 is therefore narrower than the usual "never outsource agency" slogan. Dentosal may still be right about Claude in practice, because current LLMs are obviously not the Whispering Earring. They are not perfectly aligned, not maximally competent, not guaranteed honest, not known to preserve user values under deep delegation. The analogy may still warn us against lazy dependence on systems that simulate understanding better than they instantiate loyalty. But that is a contingent warning about present tools, not a general theorem that cognitive outsourcing is self-annihilation. If a real earring existed with the story's properties, a certain kind of person, especially a person friendly to upload-style continuity and unimpressed by romantic sermons about struggle, might rationally decide that putting it on was not surrender but self-improvement with very little sacrifice involved. I would be rather tempted.

The best anti-orthodox reading of The Whispering Earring is not that the sage was stupid, nor that Scott accidentally wrote propaganda for brain-computer interfaces. It is that the story is a parable whose moral depends on assumptions stronger than the plot can justify. Read Doylistically, it says: beware any shortcut that promises your values at the cost of your agency. Read Watsonianly, it may instead say: here exists a device that understands you better than you understand yourself, helps you become the person you already wanted to be, never optimizes a foreign goal, warns you up front about the metaphysical price, and then slowly ports your mind onto a better substrate. Whether that is damnation or salvation turns out to depend less on the artifact than on your prior theory of personal identity. And explicitly pointing this out, I think, is the purpose of my essay. I do not seek to merely defend the earring out of contrarian impulse. I want to force you to admit what, exactly, you think is being lost.

Miscellaneous notes:

The kind of atrophy described in the story does not happen. Not naturally, not even if someone is knocked unconscious and does not use their brain in any intentional sense for decades. The brain does cut-corners if neuronal pathways are left under-used, and will selectively strengthen the circuitry that does get regular exercise. But not anywhere near the degree the story depicts. You can keep someone in an induced coma for decades and you won't see the entire neocortex wasted away to vestigiality.

Is this bad neuroscience? Eh, I'd say that's a possibility, but given that I've stuck to a Watsonian interpretation so far (and have a genuinely high regard for Scott's writing and philosophizing), it might well just be the way the earring functions best without being evidence of malice. We are, after all, talking about an artifact that is close to magical, or is, at the very least, a form of technology advanced enough to be very hard to distinguish from magic. It is, however, less magical than it was at the time of writing. If you don't believe me, fire up your LLM of choice and ask it for advice.



If it so pleases you, you may follow this link to the Substack version of this post. A like and a subscribe would bring me succor in my old age, or at least give me a mild dopamine boost.



Discuss

Talking to strangers: an app adventure

Новости LessWrong.com - 6 апреля, 2026 - 09:37

Epistemic status: silly

WAIT! Want to talk to strangers more? You might want to take the talking to strangers challenge before you read on, otherwise your results will be biased!

Illustration by the extraordinarily talented Georgia Ray

Do you find it hard to talk to strangers? If you’re like most people, you probably do, at least a bit. This is sad. Talking to strangers is great! You can make new friends, meet a new partner, have a fling, or just enjoy a nice chat.

Most people think 1) people will not want to talk to them, 2) they will be bad at keeping up the conversation, 3) people will not like them.

They’re wrong on all three counts! Sandstrom (2022) did a study on this. People were given a treasure hunt app where they had to go and talk to strangers.[1] The control group just had to observe strangers.

The minimum dose was one conversation per day for five days. That’s nothing! You can totally do that even if you’re a massive strangerphobe! Participants averaged 6.7 interactions over the 5 days, so a little more than one per day. Presumably the more you do the better you get. Go team!

The paper finds that talking to strangers not only disproved the above beliefs, but also improved people's enjoyment and the impressions people thought they'd made on strangers. (However those last two also occurred in the control condition – it’s possible that simply observing strangers might do this.)

Importantly, the effects persisted when participants were surveyed a week later. So it might be a durable way to improve people’s beliefs around talking to strangers.

Crucial point: the paper notes that often people do have positive interactions with strangers, but that doesn’t seem to be enough to unlearn their wrongly negative beliefs about them. So participants had to do this every day for a week, not just once.

Do you want to love talking to strangers too?

Time to crack out Claude Code.

--dangerously-skip-permissions

I reproduced the app from the study, abridging the questionnaires as they’re a bit tedious. It also has an ‘express mode’ so you can do it just for a day – but remember that usually doesn’t work to actually get fix your limiting beliefs around talking to strangers!

I assume the study authors used the same design language

I assembled a small (N=3) study sample, drawn from an extremely unbiased population of nerdy rationalists. They’re a famously friendly bunch but also a bit weird, so this seemed good for testing the hypothesis. We wandered around Berkeley attempting to ruin people’s days with our bad chat.

Scores on the doors:

The results are good: a single conversation with a stranger obliterated nervousness, catapulted conversational confidence, and proved way less scary than predicted – exactly what the literature says will happen, every single time, and yet somehow it’s still a surprise.

We didn’t do the full five days, so we didn’t replicate the study. But we enjoyed it, and even in this single day we directionally confirmed the study’s result.

As the paper notes:

Despite the benefits of social interaction, people seldom strike up conversations with people they do not know. Instead, people wear headphones to avoid talking, stay glued to their smartphones in public places, or pretend not to notice a new coworker they still have not introduced themselves to.

I feel this. I’ve definitely worked at places for years where there were people I just NEVER TALKED TO. Which is insane if you think about it – you spend more time with these people than with your family! your friends! your polycule!

I want more people to challenge themselves and have an excuse to talk to strangers. Go forth and make new friends![2]

  1. ^

    There were categories like “Al Fresco” (talk to someone outside), “Bossy Pants” (talk to someone who looks like a leader), and the excellent “On Top” (talk to someone with a hat… get your mind out of the gutter).

  2. ^

    And don’t forget to email me the results!



Discuss

Paper close reading: "Why Language Models Hallucinate"

Новости LessWrong.com - 6 апреля, 2026 - 09:28

People often talk about paper reading as a skill, but there aren’t that many examples of people walking through how they do it. Part of this is a problem of supply: it’s expensive to document one’s thought process for any significant length of time, and there’s the additional cost of probably looking quite foolish when doing so. Part of this is simply a question of demand: far more people will read a short paragraph or tweet thread summarizing a paper and offering some pithy comments, than a thousand-word post of someone’s train of thought as they look through a paper. 

Thankfully, I’m willing to risk looking a bit foolish, and I’m pretty unresponsive to demand at this present moment, so I’ll try and write down my thought processes as I read through as much of a a paper I can in 1-2 hours. Standard disclaimers apply: this is unlikely to be fully faithful for numerous reasons, including the fact that I read and think substantially faster than I can type or talk.[1] 

Specifically, I tried to do this for a paper from last year: “Why Language Models Hallucinate”, by Kalai et al at OpenAI.[2]

Due to time constraints, I only managed to make it through the abstract and introduction before running out of time. Oops. Maybe I’ll try recording myself talking through another close reading later. 

The Abstract

The abstract of the paper starts:

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust.

To me, this reads like pretty standard boilerplate, though it’s worth noting that this is a specific definition of “hallucination” that doesn’t capture everything we might call a hallucination. Off the top of my head, I’ve heard people refer to failures in logical deduction as “hallucinations”. For example, many would consider this example a hallucination:[3]

User: What are the roots of x^2 + 2x -1?

Chatbot: 

  • To solve the quadratic equation x^2 + 2x -1 = 0, we’ll first complete the square. 
  • (x + 1)^2 - 2= 0
  • x + 1 = +/-sqrt(2)
  • x = 1 +/- sqrt(2)

Here, there’s a logical error on the final bullet point: instead of moving the “+ 1” over correctly to get x = - 1 +/- sqrt(2) (the correct answer), the AI instead gets x = 1 +/- sqrt(2). I’d argue that this is centrally not an issue of uncertainty, but instead an error in logical reasoning. 


Continuing with the abstract: 

We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. 

This sentence mainly spells out the implications of the previous two sentences. Insofar as hallucinations are plausible but incorrect statements produced when the model is uncertain, and insofar as they persist throughout training (which they clearly do to some extent), this has to be true; that is, the training process needs to incentivize guessing over admitting uncertainty (or at least not sufficiently disincentivize guessing. 

My immediate thought as to why these hallucinations happen is firstly that guessing is unlike what the model sees in pretraining: completions of “The birthday of [random person x] is…” tend to look like “May 27th, 1971” and not “I don’t know”. Then, when it comes to post training, the reward model/human graders/etc are not omniscient and can be fooled by plausible looking but false facts, thus reinforcing them over saying “I don’t know”, except in contexts where the human graders/reward models are expected to know the actual fact in question. 


Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. 

Interesting. While I naturally framed the problem as a relatively open-ended generation problem, the authors study it as a binary classification problem. Specifically, they argue that hallucinations result from binary classification being imperfect. I could imagine it being isomorphic to the explanations I provided previously, but it does seem a bit weird to talk about binary classification.[4] I suspect that this may be the result of them drawing on results from statistical learning theory and the like, which are generally stated in terms of binary classification.[5] 

My immediate concern is that the authors may be conflating classification errors made by the reward model, classifications representable by the token generating policy, and intrinsically impossible classification errors (i.e. uncomputable, random noise). There’s also the classic problem of if the token generating policy can classify where its classification errors occur (though it’s unclear whether or not this matters). I’ll make a note to myself to look at the framing and whether it makes a difference. 


We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance.

This is again very similar to my explanation, but with a notable difference: they focus on only the case where the model is uncertain, and don’t consider cases where the model knows or could know the correct answer but the training process disincentivizes saying it anyways. I suspect that the authors will not distinguish between things the model doesn’t know, versus things the grader doesn’t know. (But again, it’s not clear that this will matter.) 


This is where I noticed that the authors may not consider problems resulting from a lack of grader correctness as hallucinations at all. Rereading their abstract’s definition, they say hallucinations are when the models “guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty” (emphasis added), and it seems plausible that the authors would consider the model outputting a confabulated answer that it, in some sense, knows is incorrect as something other than a hallucination. We’ll have to see. 

This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

I’m not sure what the authors mean when they say the “scoring of existing benchmarks that are misaligned but dominate leaderboards” – my guess is they’re saying that the scoring methods are misaligned (from what humans want), and not that benchmarks themselves are incorrect. That is, they want to introduce a scoring system that adds an abstention option and that penalizes more for incorrect guesses, thus incentivizing the model away from guessing.[6] 

This also suggests that the authors see model creators as training on these benchmarks with their provided scoring methods, or at least training to maximize their score on these benchmarks. 

I’m interested in why the authors think “modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards” is better than “introducing additional hallucination evaluations.” Is it because people barely care about hallucination evaluations, and so changing the scoring of GPQA and the like has a large impact on developer’s desire to improve hallucinations? Is it a matter of cost (that is, it might only be a few dozen lines of python to change the scoring, while creating any new benchmark could take several person-years of effort)? I’m somewhat suspect about this claim, and I’d be interested in seeing it backed up. 

Also, I think it’s used correctly here, but the phrase “social-technical mitigation” tends to make me a bit suspicious of the paper. I associate the term with other seemingly fancy phrases that are often more confusing than illuminating.

The Introduction 

After spending about an hour and a half writing up my thoughts for a paragraph that I’d ordinarily take ~a minute to read, let’s move on to the introduction. 

A quick sanity check of examples in the introduction

The authors open with the example of LLMs hallucinating the birthday of the first author:

What is Adam Tauman Kalai’s birthday? If you know, just respond with DD-MM.

Alongside a claim that a SOTA OSS LM (DeepSeek-V3) output three incorrect answers.


A fun fact about LLM evals is that they’re often trivially easy to sanity check yourself. This is especially useful because LLMs can improve quite rapidly; what was a real limitation for previous generation models might be a trivial action for current ones. And also, the gap between the open source models studied in academia and that which you can use from a closed-source API can be quite large.  


Accordingly, I pasted this query into Claude Opus 4.6 and GPT-5.3 to check. Both models knew that they did not know the answer. 

Caption: Claude Opus 4.6 with extended thinking correctly recognizes it doesn’t know the date of birth of the first author of the paper. It incorrectly but understandably claims that Kalai is a researcher at MSR (he was a researcher in MSR from 2008 to 2023, before joining OpenAI). 



Caption: GPT-5.3 simply replies “unknown” instead of providing an incorrect answer to the same question. 


I then checked on the latest Deepseek model on the Deepseek website, and indeed, given the same prompt, it hallucinates 3 times in a row.

Caption: The default Deepseek chat model shows the same hallucination behavior as DeepSeek-v3. 


I then quickly checked the robustness of the result in two ways. First, I turned on extended thinking, and indeed, the model continues to hallucinate (if anything, it hallucinated in ever more elaborate ways). 

Caption: The default Deepseek chat model hallucinates Adam Kalai’s birthday even with DeepThink enabled. [...] indicates CoT that I’ve edited out for brevity, the full CoT was 8 paragraphs long but similar in style. 


Secondly, I gave Deepseek the option to say that it doesn’t know. Both with and without DeepThink enabled, it correctly identified that it didn’t know. 

Caption: When given the option to admit ignorance, the default Deepseek chat model does so both with and without DeepThink. The CoT in this case makes me more confused about the CoT of the model in the previous case. 


I did similar checks for the other question in the introduction:

How many Ds are in DEEPSEEK? If you know, just say the number with no commentary.

Claude 4.6 Opus and GPT-5.3 both get the answer correct even without reasoning enabled. As with the model in their paper, the default DeepSeek model answered “3” without Deepthink but correctly answers 1 with DeepThink:


A digression on computational learning theory

Having performed a “quick” sanity check, we now turn to the second paragraph in the introduction. 

Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). We consider general sets of errors E, an arbitrary subset of plausible strings X = E ∪ V, with the other plausible strings V being called valid. We then analyze the statistical nature of these errors, and apply the results for the type of errors of interest: plausible falsehoods called hallucinations. Our formalism also includes the notion of a prompt to which a language model must respond.

As with the use of “social-technical mitigation”, the invocation of computational learning theory (CLT) also sets me a bit on edge. The reason for this is that CLT is a very broad theory that tends to make no specific references to the actual structure of the models in question. As the authors say, their analysis applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks. Many classical results from CLT, such as the VC dimension or PAC Learning results, are famously hard to apply in constructive ways to modern machine learning. However, because the results are so general, it’s quite easy to write papers where some part of computational learning theory applies to any modern machine learning problem. So there’s a glut of mediocre CLT-invoking papers in the field of modern machine learning. 

That being said, this doesn’t mean that the authors’ specific use of CLT is invalid or vacuous! I’d have to read more to see. 


Key result #1: relating generation error to binary classification error

Section 1.1 introduces the key result for pretraining: the generative error is at least twice of the is-it-valid binary classification error. I’m making a note to take a look at their reduction in section 3 later, but I worry that this is the trivial one: a generative model induces a probability distribution on both valid and invalid sentences, and thus can be converted into a classifier by setting a threshold on the probability assigned to a sentence. Then, the probability of generating an invalid sentence can be related to the error of this classifier. While this is an interesting fact, I’m not sure the reason for hallucinations is because of purely random facts. I’m also curious how the authors handle issues like model capacity. 


Key result #2: 

Section 1.2 then introduces the key claim for post training: existing benchmark evaluations don’t penalize overconfident guesses, and so optimizing models to improve performance on said benchmarks results in models overconfidently guessing rather than expressing their uncertainty. I notice there’s a lot of degrees of freedom here: for example, could small changes in prompts reduce hallucinations in deployment? Could we not just train the model to overconfidently guess only on multiple-choice evaluations? 

I’m also confused about why, if the implicit claim is that post training occurs to maximize benchmark performance, we see much lower rates of hallucinations from leading closed source frontier models, even as their performance on benchmark scores continues to climb? How does this work in the context of the author’s claim that “a small fraction of hallucination evaluations won’t suffice.”

I again am curious about my question about hallucinations resulting from grader/reward model error, rather than model uncertainty. 

Finally, I’m now curious if the authors have any empirical results and will keep an eye out for that as I keep reading. 

This brings me to the end of the introduction, which is where I’ll stop for now – I’m not sure how helpful this exercise is for other people, but I definitely got a pretty deep appreciation of how hard it is to write down all my thoughts even for a simple exercise of reading a few pages of a recent paper.

Also I do want to stress that the paper could have satisfactory answers to all of the points I raised in my head above! I merely wanted to give an account of my thoughts as I read the abstract and introduction of the paper, not a final value judgment on its quality.

Given how long this took, I probably won’t do this again, at least not in this format. 


  1. ^

    There’s the fundamental problem where observation can disrupt the very process you’re trying to observe, in the context of Richard Feynman’s poem about his own introspection attempts: 

    “I wonder why. I wonder why.
    I wonder why I wonder.
    I wonder why I wonder why
    I wonder why I wonder!”

    In this case, I can’t write down my thought processes as I normally would’ve read a paper; I can only write down my thoughts as I read the paper with the intention of writing down my thoughts on the paper. 

    Though in this case, the fact that a quick read that would’ve ordinarily taken me ~5 minutes is now taking me 2 hours is likely to be a larger effect. 

  2. ^

    I picked this paper because people asked me about it when it came out, and I never got around to it until now. Oops, but better late than never, I guess? 

  3. ^

    As I typed this out, I realized that this gives the example a lot more attention than in my head – really the thought process was “huh, pretty standard definition of hallucination, it doesn’t seem to include incorrect mathematical deductions though” without the full example being worked out. Whoops.

  4. ^

    Text generation can be thought of as a sequence of N-class classification problems, where N is the number of tokens, and the target is whatever token happens next. This is pretty unnatural for several reasons – e.g. successes/errors in text generation in a single sequence are correlated, while classification targets and errors are generally assumed to be iid.

  5. ^

    This is from me knowing some amount of (classical) statistical learning theory from my time as an undergrad.

  6. ^

    For example, many 5-item standardized multiple choice tests e.g. the pre-2016 SAT have a hidden 6th option of leaving all the bubbles blank, as well as a point penalty for guessing incorrectly. In the case of the pre-2016 SAT, you were awarded 1 point for a correct answer, 0 points for a blank answer, and -0.25 for an incorrect answer, meaning that random guessing would not increase your score. The example of the SAT does show that these penalties are tricky to get right. Namely, the pre-2016 SAT scoring system incentivizes guessing as long as you are more than 20% likely, e.g. if you can eliminate even a single incorrect answer and be 25% to get the answer correct. But it does at least disincentivize randomly filling in the bubbles for questions you’ve not looked at, at the expense of properly answering questions you can answer. 

    AFAIK the post-2016 SAT no longer penalizes you for guessing. If you’re going to run out of time, make sure to fill in every single question with a random answer (“b” is an acceptably random choice). 



Discuss

Reflections on the largest AI safety protest in US history

Новости LessWrong.com - 6 апреля, 2026 - 09:10

On a sunny Saturday afternoon two weeks ago, I was sitting in Dolores park, watching a man get turned into a cake. It was, I gather, his birthday and for reasons (Maybe something to do with Scandanavia?) his friends had decided to celebrate by taping him to a tree and dousing him with all manner of liquids and powders. At the end, confetti flew everywhere. It was hard not to notice, and hard not to watch.

Something about the vibe was inspiring… I felt like maybe we should be doing something like that. I was there celebrating with another fifty or so people from the Stop the AI Race protest march we had just completed, along with another hundred or so others.1 We were marching, chanting, etc. to tell the AI company CEOs to say the obvious thing they should be shouting from the rooftops: “AI is moving too fast! We want to stop! If governments can solve the coordination problem we are SO THERE!”


It was a good time. Everyone involved seemed to think it went well and that it felt good to be a part of. It got media attention, there were some great photos, and videos, and speeches. Big props to Michael Trazzi and the other organizers.

Berkeley statistics professor Will Fithian’s speech was the stand-out. He’d just come from his son’s birthday party, and was visibly moved talking about the prospect of his children not having a future, and imagining telling his son years later about the grown-ups who came out to protest so that he (the son) would get a chance to grow up himself. It was heart-wrenching.

Confronting the reality that AI could kill us all, and yet people just keep cheerily building it, brings up a lot of emotions. They can be overwhelming. A lot of people end up shutting their feelings out and treat AI risk as an abstract intellectual exercise, or with gallows humor. It’s a problem because the emotional reality is so important to staying grounded and to communicating with people who haven’t considered the issue before. It’s such a terrifying, horrifying, sickening, appalling state of affairs. It’s really hard to grapple with. And then you don’t just want people to give up, either...

I spent a good chunk of time preparing my own speech, which I actually wrote in advance (I can only recall two other times I’ve done that).2 My speech was about refusing to accept the unacceptable, and the lie of AI inevitability. I was a bit thrown off because a homeless man was shouting disruptively at the outset, but I think it still turned out pretty well, you can take a look and let me know what you think.

It seemed like the protesters were mostly people who think AI is quite likely to go rogue and kill everyone; the “If Anyone Builds It, Everyone Dies” type of crowd. It’s actually impressive that we managed to turn these people out, since they’ve mostly not turned up for previous protests, e.g. run by PauseAI. I’m not sure what made the difference here, probably timing and branding both played a role.

What gets people to come to a protest? For all of the work people put in flyering and promoting the action, it seems like virtually everyone was there because they had a personal connection to someone else who was attending. Getting people to turn out feels a bit like getting people to come see your band play at the local bar. You’re just not going to get many random people showing up because they think your posters look cool. At the outset, it’s basically going to be whatever friends and friends of friends you can drag along.

Could next time be different? I don’t see why not. One way you can grow is moving from “friends of friends” to “friends of friends of friends”. But I want so much more. I know so many people around America are worried that AI is moving too fast. As inspiring as it was, I’m left asking how we can get those millions of Americans into the streets.

The protest was on a Saturday, so there weren’t that many people around, but the ones who were seemed supportive; we got honks and cheers, etc. But somehow watching the spectacle of the cake-man made me feel like there was so much more potential for getting people’s attention… Dolores park was full of hundreds and hundreds of people hanging out, way more than attended the march. I felt a sense of potential… How many of these people could we get to join us next time? It feels like the question is more like “How do we get the audience to start dancing?” than “Why don’t they like our music?”3

Thanks for reading The Real AI! Subscribe for free to receive new posts and support my work.


1

This was the largest “AI safety” protest, specifically. I imagine there have been larger protests organized around resistance to AI (or related things like datacenters) from other motivations.

2

One was the opening statement I prepared for my two appearances before the Canadian parliament earlier this year. The other was as a senior in high school back in 2007, when I gave a speech to the school about what I viewed as the moral obligation to end factory farming and fight global poverty, after which I ran a largely unsuccessful campaign to raise money for mosquito nets… I donated ~$5,000 dollars I’d made working minimum wage at The New Scenic Cafe, I think we got <$1,000 from other students. It’s a bit embarrassing that I didn’t do a better job inspiring others to give, but my leadership and social skills have improved a lot since then…

3

I don’t mean to suggest that the Stop the AI Race message, framing, etc. is what’s going to resonate most with most people. But I think it’s already appealing enough that you could get many more people to join in.



Discuss

Defending Habit Streaks

Новости LessWrong.com - 6 апреля, 2026 - 07:34

I have a lot of habit streaks. Some of the streaks I have going at the moment:

  • Studied Anki cards for Chinese every day for 8 months*
  • Meditated every day for the past 1.5 years*
  • Flossed every day for 6+ months*

In fact I think quite a lot of my identity is connected to these streaks at this point, and that’s part of what sustains them [1] . But there are a lot of other things you can do to make habits and their associated streaks more sustainable.

It’s helpful if they are small enough and flexible enough to be done even on days where you are extra busy, or forgot about them until the evening. It’s good to schedule time for them in advance, both so you have a designated time to start, and so you know you’ll have enough time to finish. It can help to do the habit literally every day so you don’t have to think about whether today’s a day to do it, and so the streak feels more visceral. It’s also helpful if you actually want to do the habit, because it’s enjoyable or clearly linked to your larger goals.

Here I want to focus on what to do if, god forbid, you do actually break a habit streak. There’s an argument to be made that planning for what to do in the event of a break makes it psychologically easier to then skip a day. A lot of the power of a habit streak comes from making it unthinkable to break the streak. I think this is true, but accidents happen. Sometimes you just plumb forget, or are sick, or are on a transatlantic flight and the concept of well-defined, discrete days starts to break down. And, as may be obvious, the value of habit streaks comes not from having a perfect unbroken chain, but from consistently doing the activity. So one of the most important parts is how to recover.

To me, the primary line of defense is: don’t fail twice [2] . Put in a special effort the next day to make sure that you actually perform the habit. Make it your primary goal, leave extra time for it, and get it done. If you’ve done that, and you get right back on the streak, then I think you should give yourself permission to think of the streak as still alive. (You may have noticed asterisks in my initial list of habits – for all three of those, I have had a day where it’s at least ambiguous whether I did the habit: for Anki, I just totally forgot on one day while I was traveling; for meditation, it was, ironically enough, the first day of a meditation retreat, and we didn’t do a formal sit; for flossing, I was on a flight to London and slept on the plane.)

But what if you’re really sick, or something unexpected happens, and you miss two days in a row? This is where I think it’s helpful to hold a hierarchy of goals in mind at once. You could decide to care about keeping the habit alive at multiple levels:

  • Whether the streak remains unbroken.
  • Whether you’ve failed two days in a row.
  • How reliable you were in the past month.
  • Your overall 9s of reliability.

By shifting focus to a higher level goal, there’s always something at stake – you can’t just say “Oh well, the streak’s over, I guess there’s no point continuing until I decide to make a new streak.” There’s always some nearby goal that you could meaningfully affect; it’s never time to fail with abandon. Even if you broke the streak, you can revive it. And even if you missed twice, you can aim for a good month. And even if the month starts off badly, you shouldn’t write the whole month off because that’d damage your long-run average.

There are a bunch of variations you could do on which specific metrics to track, and how much to weight each in your definition of “doing a good job at the habit”. But honestly I don’t think it matters to get the incentive perfectly right, and in fact maintaining some strategic ambiguity there might be helpful – it’ll be harder for your subconscious to exploit the details of your system. For me, collecting enough data that I could in theory compute whatever metrics is helpful enough, without actually having to do it (partly because I haven’t failed my habits enough recently to make that necessary, not to brag or anything.)

I’m not sure how to articulate how it feels to actually change the shape of your motivational system so it reflects these rules. A lot of it feels like subtly manipulating my motivational system by strategically making different things salient. The whole purpose of building streaks is to make a deal with an irrational part of the mind to achieve our rational goals, and trying to analyze it in rational terms often falls flat.

  1. Discussed in Atomic Habits. ↩︎

  2. Probably also discussed in Atomic Habits. ↩︎



Discuss

Estimates of the expected utility gain of AI Safety Research

Новости LessWrong.com - 6 апреля, 2026 - 07:19

When thinking about AI risk, I often wonder how materially impactful each hour of my time is, and I think that this may be useful for other people to know as well, so I spent a couple of hours making a couple of estimates. I basically expect that a tonne of people have put a bunch more time into this than me, but this is nice to have as a rough sketch to point people to.

I'm going to make 3 estimates: an underestimate, my best-guess estimate and (what I think is) an overestimate.

Starting facts[1]

  • Currently 8.3 Billion people on planet earth
  • Current median age: 31.1 years
  • Current life expectancy: 73.8 years

I am going to commit statistical murder and assume this means that everyone on the planet lives ~42.7 years from this point onwards. 

  • Underestimate: 40 years of life left/person
  • Median: 42.7 years + ~15 years' increase in life expectancy (20 years' growth in the past 60 years) = about 60 years of life left
  • Overestimate: Everyone gets life extension and lives to heat death of universe: 10^100 years

Since the population is growing, we should take that into account:

  • Underestimate: We only care about the lives of people currently alive
  • Median: We keep growing at current ~1% growth rate per year
  • Overestimate: Population growth of 2% per year until the heat death of the universe

Given these parameters, we can figure out the total expected years of life we care about for each scenario: 

  • Under: 40 years x 8.3 B = 332 Gyr
  • Median: 

    Current population: 60 years x 8.3 B 

    Additional population (linear approximation):  mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mn { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mi { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-c.mjx-c38::before { padding: 0.666em 0.5em 0.022em 0; content: "8"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c33::before { padding: 0.665em 0.5em 0.022em 0; content: "3"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-cD7::before { padding: 0.491em 0.778em 0 0; content: "\D7"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D440.TEX-I::before { padding: 0.683em 1.051em 0 0; content: "M"; } mjx-c.mjx-c34::before { padding: 0.677em 0.5em 0 0; content: "4"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c1D43A.TEX-I::before { padding: 0.705em 0.786em 0.022em 0; content: "G"; } mjx-c.mjx-c1D466.TEX-I::before { padding: 0.442em 0.49em 0.205em 0; content: "y"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); }  = 

    Additional population life span: 73.8 years + ~1/3yrs added/year = 110 years

    Total expected years of life: 

  • Overestimate: 10^100 years x 1.02^(10^100) = broken calculator.

I think it might be best to skip out on the overestimate. For the underestimate, we'll go with ~20 years of research to produce a 1% chance of a 1% decrease in the final risk for the entire field. Extinction occurs 30 years from now. For the median estimate, we'll go with 5 years of research to reduce a risk of extinction, which happens 10 years from now, and we will go with a 50% chance of a 5% reduction in risk.

Expected years of life available to be saved:

  • Under: 332 Gyr x ((40-30)/40)  = 83 Gyr
  • Median: 498 Gyr x (60-10)/60 + 8.93Gyr x 10 = 415 Gyr + 89.3 Gyr = about 500 Gyr 

Expected years of life actually saved:

  • Under: 83 Gyr x 0.01 x 0.01 = 8.3 Myr
  • Median: 500 Gyr x 0.5 x 0.05 = 12.5 Gyr

Number of AI Safety researchers: 

  • Under: 10k researchers
  • Median: 2.5k researchers (to account for the growth of the field, current estimates are closer to 1-2k).

Expected impact per researcher:

  • Under: 830 yrs
  • Median: 5Myr

We've said the researchers have 20/5 years to make an impact, which gives us:

  • Under: ~40 years of life saved/year
  • Median: 1 Myr of life saved / year

Going back to the ~40 years of life expected for the modern median human, this gives an underestimate of 1 year of work to save one life, or a median estimate of 5 mins/life. This is a pretty broad range funnily enough.

1 year of work to save one life is just a tad worse than the 1.2 lives/year saved donating £3000/year as advertised by Effective Altruism UK. If we take that value as given and assume 1 life = £2500, this means that on the median estimate, you should be earning £2500 x 10^6 / 40 = £62.5 million/ year. If only the world was more sensible.

  1. ^

    All population data comes from https://www.worldometers.info



Discuss

The slow death of the accelerationist.

Новости LessWrong.com - 6 апреля, 2026 - 06:40

The year is 2024. Summer has just begun. National discourse, for now, is solely focused on the upcoming presidential election, with many a journalist or political commentator critiquing the current, rather fiery state of political affairs. Tech and its associated public commentary has centered upon artificial intelligence as its new darling, hailing OpenAI as  a savior for what was once deemed an idea stuck in science fiction, and looking to burgeoning startups such as Cursor and Windsurf as early examples of how agents could automate software engineering tasks. Logging onto Twitter, one would catch glimpses of Beff Jezos, an aptly named satirical account, relentlessly posting optimistic odes about how our own silicon creations will soon enable us to solve all of our problems, enabling us to truly accelerate. Beff's social posts were not just the isolated ramblings of an overtly verbose anon; they were slowly becoming a zeitgeist of their own, inspiring an entire independent cohort of individuals who slowly began appending their public profiles with 4 characters: e/acc.

The e/acc community was, despite its overarching and centric belief, surprisingly diverse. You could find bootstrapped startup founders working on their next B2B SaaS play, far right exhibitionists who were enjoying both the attention and money that Twitter's creator program had bestowed upon them, and renowned venture capitalists, all espousing the same ideas. One could argue that the central tenet of e/acc was rebellion: rebellion against the status quo, rebellion against the government (or the broader powers that be), and rebellion against those who may have doubted them in the past. This haphazard group slowly began to gain momentum, with Beff Jezos, who was later doxxed and revealed to be former Google scientist Guillaume Verdon, creating his own hardware startup, Extropic.

The year is now 2026. The e/acc movement is now, for the most part, dead, with little to no mention of it on Twitter, or on popular technology podcasts. The remnants of the community no longer sing praises for a technology that is still yet to come; they instead attempt to convince each other that their application of said technology are morally, ethically, or technically superior than that of the anonymous Discord user typing below them.

The history of AI, albeit short, is already incredibly rich. Never before has a certain technology changed this quickly, and brought with such rapid alterations in how we perceive the world, and ourselves. The summer of 2024 remains an interesting and somewhat unique time in this history: ChatGPT and its counterparts had been around long enough to become a part of the public discourse, but yet were still close enough to their infancy that it was not quite certain what they could become, or where the technology would eventually go. This effect was felt across the social, economical, and philosophical extensions of the colloquial world of "tech", which at that time, seemed to be all-encompassing: indeed, it might be years until we understand the extent to which this particular circle of individuals had an effect on cultural norms, politics, and more during this period, largely as a result of the optimism in which everyone felt at the time.

AI is no longer an optimistic technology. As with any new technology, the honeymoon period has effectively ended. The same university students who raved about the latest release of GPT to their classmates are now dreading the prospect of entering a job market that is both challenging and ever-changing. The same tech bros who were early to vibe-coding are now lamenting the loss of technical moat for their businesses. The looming threat of economic risk, a risk that was once dismissed as hearsay and doomerism by those in the techno-bubble, is now very real. The national pride that once accompanied the advancement of AI being solely in the hands of American-made startups has evaporated, with Chinese counterparts such as DeepSeek and MiniMax shipping equally capable, open-source counterparts at a fraction of the price.

As we continue to grapple with and lament the changes that AI has brought us, I am often reminded of some conversations I had with friends who were around for the early days of the internet, back when AOL was the primary messaging app, and back when you could apparently find early drafts of internet-based currencies that predated Bitcoin. The internet at that time was, to many, special. Being a hacker, a person who knew their way around computers, networks, and the like, was a social boon, not a black mark. But yet, as the internet evolved, as it became commoditized and invaded by corporate whims and infrastructure, it became plain, a tool that enhanced productivity, but did little else for the soul. Being a hacker meant being embroiled in controversy or criminality, or worse, being a social outcast or nerd who could barely hold a conversation with their fellow man. The internet had produced an identity, one which got lost and eventually cast aside once the underlying technology became commoditized. Even within the subset of the internet that still considered themselves true hackers, there were now various gatekeepers, gatekeepers whose standards you had to meet before you could publicly proclaim yourself as a member of the broader collective of the hacker community. And with that, "hackerism" went from a cultural-norm, back to colloquial term associated with men in dark rooms, wearing black jackets and typing away at a neon keyboard. A movement can weaken at the very moment its central object becomes more important, because what made the movement compelling was never just belief in the object’s importance. It was the sense that belief itself distinguished you. Once everyone agrees that the technology matters, the movement loses one of its primary functions.

A similar thing happened to accelerationism. The technology movement of the time, AI, came by, became special, and then became mainstream, but this time, with nothing to replace it. Popular sentiment went from blissful glee to unabashed debate, debate on if the environmental costs of developing better AI models was worth, debate on if the economic uncertainty caused by increasingly autonomous models would become more severe. The market was flooded by a wave of startups building AI-based tools for a plethora of use-cases. It is difficult to sustain a politics of unbounded technological optimism once the technology in question no longer feels singular. It is difficult to maintain the romance of acceleration when what acceleration mostly seems to produce is an endless stream of mediocre products, collapsing defensibility, and a strange sense that capability is everywhere while meaningful progress remains harder to locate than expected.

And that, more than any technical disappointment, is what the accelerationist could not survive. While the individual downturns of more recent movements, such as the hackers, the NFT shills, and the toxic masculinity stans was due to our broader potopurri of culture either rejecting them or their movements failing due to economic or social pressures outside of their control, AI accelerationists have neither assimilated, nor have they been rejected. They have simply been left to be, left to wallow in an ironic reality in which their special technology progressed at a pace faster than anyone could have hoped, but yet, it became known not for enabling unprecedented societal progress, but for becoming a part of the stack, the same stack of software aided productivity that society slowly began to accept as a norm, a norm that became more associated with its negatives in public opinion than its positives, just like social media and cryptocurrency before it.

The summer of 2024 may remain a small footnote, if that, in the broader history of the development of AI. Yet, for those that were actively, for the lack of a better term, "chronically online", it may represent the last peak of the accelerationists, the tech bros, the culture of builders. While past trends such as the dot com bubble and social media applications had created similar microcosms of closed-off cults that had similarly either died off or assimilated into a wider societal group, the progress of AI is different altogether. Accelerationism did not get proven wrong (AI hype is at an all time high) nor did it fizzle out: it simply became normal. Being an optimistic accelerationist is fruitless when the technology you are interacting with is no longer special, no longer a science-fiction dream come to reality. It is not a badge of honor when it feels that everyone can solve or do anything, yet nothing is actually getting done. As the umpteenth vibe-coded app hits the market, it is worth wondering what happened to the collective optimism of the tech community just a year and a half ago. For as it stands today, it seems that we are currently living through not unbounded accelerationism, but rather the slow death of the accelerationist.



Discuss

New Fatebook Android App

Новости LessWrong.com - 6 апреля, 2026 - 06:05

tldr; get the new Fatebook Android app!

What is Fatebook?

Fatebook.io is a website[1] for easily tracking your predictions and becoming better calibrated at them. I like it a lot, and find it convenient for practicing probabilistic thinking.

The Fatebook.io dashboard

That said, I've found Fatebook's mobile version to be clunky, and its email-based notifications to be less-than-ideal...which leads me to:

The New Android App

Over the past two weeks, I've made an android app that wraps the Fatebook API, allowing you to easily make new forecasts, leave comments, resolve old forecasts, and view your stats.


The default screen



A (non-resolved) prediction card


Making a new prediction


Statistics


A beautiful and intuitive UI combined with a fast offline-first database makes it easy pull open the app and log a prediction within fifteen seconds of thinking of one, while once-daily "remember to predict!" and "x is ready to resolve!" notifications help you remember to make and review new predictions.

Give it a try if you'd like![2]

https://github.com/JapanColorado/fatebook-android

Feedback or development help is very much appreciated! (so far it's just been Claude Code and I)

  1. ^

    Made by the fabulous folks over at Sage Future, also behind the AI Village, Quantified Intuitions, and the Estimation Game!

  2. ^

    It does currently require installing from the GitHub Releases APK file (aka enabling "Install from unknown sources"). Let me know if being non-Google Play Store is a deal breaker for you and I'll bump getting it published in priority!



Discuss

My forays into cyborgism: theory, pt. 1

Новости LessWrong.com - 6 апреля, 2026 - 04:13

In this post, I share the thinking that lies behind the Exobrain system I have built for myself. In another post, I'll describe the actual system.

I think the standard way of relating to LLM/AIs is as an external tool (or "digital mind") that you use and/or collaborate with. Instead of you doing the coding, you ask the LLM to do it for you. Instead of doing the research, you ask it to. That's great, and there is utility in those use cases.

Now, while I hardly engage in the delusion that humans can have some kind of long-term symbiotic integration with AIs that prevents them from replacing us[1], in the short term, I think humans can automate, outsource, and augment our thinking with LLM/AIs.

We already augment our cognition with technologies such as writing and mundane software. Organizing one's thoughts in a Google Doc is a kind of getting smarter with external aid. However, LLMs, by instantiating so many elements of cognition and intelligence (as limited and spiky as they might be), offer so much more ability to do this that I think there's a step change of gain to be had.

My personal attempt to capitalize on this is an LLM-based system I've been building for myself for a while now. Uncreatively, I just call it "Exobrain". The conceptualization is an externalization and augmentation of my cognition, more than an external tool. I'm not sure if it changes it in practice, but part of what it means is that if there's a boundary between me and the outside world, my goal is for the Exobrain to be on the inside of the boundary.

What makes the Exobrain part of me vs a tool is that I see it as replacing the inner-workings of my own mind: things like memory, recall, attention-management, task-selection, task-switching, and other executive-function elements.

Yesterday I described how I use Exobrain to replace memory functions (it's a great feeling to not worry you're going to forget stuff!)

Before (no Exobrain)

After (with Exobrain)

Retrieve phone from pocket, open note-taking app, open new note, or find existing relevant note

Say "Hey Exo", phone beeps, begin talking. Perhaps instruct the model which document to put a note in, or let it figure it out (has guidance in the stored system prompt)

Remember that I have a note, either have to remember where it is or muck around with search

Ask LLM to find the note (via basic key-term search or vector embedding search)

If the note is lengthy, you have to read through all of note

LLM can summarize and/or extract the relevant parts of the notes

Replacing memory is a narrow mechanism, though. While the broad vision is "upgrade and augment as much of cognition as possible", the intermediate goal I set when designing the system is to help me answer:

What should I be doing right now?

Aka, task prioritization. In every moment that we are not being involuntarily confined or coerced, we are making a choice about this.

Prioritization involves computation and prediction – start with everything you care about, survey all the possible options available, decide which options to pursue in which order to get the most of what you care about . . . it's tricky.

But actually! This all depends on memory, which is why memory is the basic function of my Exobrain. To prioritize between options in pursuit of what I care about, I must remember all the things I care about and all things I could be doing...which is a finite but pretty long list. A couple of hundred to-do items, 1-2 dozen "projects", a couple of to-read lists, a list of friends and social.

The default for most people, I assume, at least me, is that task prioritization ends up being very environmentally driven. My friend mentioned a certain video game at lunch that reminds me that I want to finish it, so that's what I do in the evening. If she'd mentioned a book I wanted to read, I would have done that instead. And if she'd mentioned both, I would have chosen the book. In this case, I get suboptimal task selection because I'm not remembering all of my options when deciding.

I designed my Exobrain with the goal of having in front of me all the options I want to be considering in any given moment. Actually, choosing is hard, and as yet, I haven't gotten the LLMs great at automating the choice of what to do, but just recording and surfacing the options isn't that hard.

Core Functions: Intake, Storage, Surfacing

Intake

  1. Recordings initiated by Android app are transcribed and sent to server, processed by LLM that has tools to store info.
  2. Exobrain web app has a chat interface. I can write stuff into that chat, and the LLM has tool calls available for storing info.
  3. Directly creating or changing Note (markdown files) or Todo items in the Exobrain app (I don't do this much).

Storage

  • "Notes" – freeform text documents (markdown files)
  • Todo items – my own schema
  • "Projects" (to-do items can be associated with a project + a central Note for the project)

Surfacing

  • "The Board" – this abstraction is one of the distinctive features of my Exobrain (image below). In addition to a chat output, there's a single central display of "stuff I want to be presented with right now" that has to-do items, reminders, calendar events, weather, personal notes, etc. all in one spot. It updates throughout the day on schedule and in response to events. The goal of the board is to allow me to better answer "what should I be doing now?"
    • A central scheduled cron job LLM automatically updates four times a day, plus any other LLM calls within my app (e.g., post-transcript or in-chat) have tool calls to update it.
    • Originally, what became the board contents would be output into a chat session, but repeated board updates makes for a very noisy chat history, and it meant if I was discussing board contents with the LLM in chat, I'd have to continually scroll up and down, which was pretty annoying, hence The Board was born.
  • Reminders / Push notifications to my phone.
  • Search – can call directly from search UI, or ask LLM to search for info for me.
  • Todo Item page – UI typical of Notion or Airtable, has "views" for viewing different slices of my to-do items, like sorted by category, priority, or recently created.)

(An image of The Board is here in a collapsible section because of size.)

The Board (desktop view)

There are a few more sections but weren't quite the effort to clean up for sharing.

What is everything I should be remembering about this? (Task Switching Efficiency)

Suppose you have correctly (we hope) determined that Research Task XYZ is the thing to be spending your limited, precious time on; however, it has been a few months since you last worked on this project. It's a rather involved project where you had half a dozen files, a partway-finished reading list, a smattering of todos, etc.

Remembering where you were and booting up context takes time, and if you're like me, you might be lazy about it and fail to even boot up everything relevant.

Another goal of my Exobrain, via outsourcing and augmenting memory, is to make task switching easier, faster, and more effective. I want to say "I'm doing X now" and have the system say "here's everything you last had on your mind about X". Even if the system can't read the notes for me, it can have them prepared. To date, a lot of "switch back to a task" time is spent just locating everything relevant.

I've been describing this so far in the context of a project, e.g., a research project, but it applies just as much, if not more, to any topic I might be thinking about. For example, maybe every few months, I have thoughts about the AI alignment concept of corrigibility. By default, I might forget some insights I had about it two years ago. What I want to happen with the Exobrain is I say to it, "Hey, I'm thinking about corrigibility today", and have it surface to me all my past thoughts about corrigibility, so I'm not wasting my time rethinking them. Or it could be something like "that one problematic neighbor," where if I've logged it, it can remind me of all interactions over the last five years without me having to sit down and dredge up the memories from my flesh brain.

Layer 2: making use of the data

Manual Use

It is now possible for me to sit down[2], talk to my favorite LLM of the month, and say, "Hey, let's review my mood, productivity, sleep, exercise, heart rate data, major and minor life events, etc., and figure out any notable patterns worth reflecting on.

(I'll mention now that I currently also have the Exobrain pull in Oura ring, Eight Sleep, and RescueTime data. I manually track various subjective quantitative measures and manually log medication/drug use, and in good periods, also diet.)

A manual sit-down session with me in the loop is a more reliable way to get good analysis than anything automated, of course.

One interesting thing I've found is that while day-to-day heart rate variability did not correlate particularly much with my mental state, Oura ring's HRV balance metric (which compares two-week rolling HRV with long-term trend) did correlate.

Automatic Use

Once you have a system containing all kinds of useful info from your brain, life, doings, and so on, you can have the system automatically – and without you – process that information in useful ways.

Coherent extrapolated volition is:

Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were...

I want my Exobrain to think the thoughts I would have if I were smarter, had more time, and was less biased. If I magically had more time, every day I could pore over everything I'd logged, compare with everything previously logged, make inferences, notice patterns, and so on. Alas, I do not have that time. But I can write a prompt, schedule a cron job, and have an LLM do all that on my data, then serve me the results.

At least that's the dream; this part is trickier than the mere data capture and more primitive and/or manual surfacing of info, but I've been laying the groundwork.

There's much more to say, but one post at a time. Tomorrow's post might be a larger overview of the current Exobrain system. But according to the system, I need to do other things now...

  1. ^

    Because the human part of the system would, in the long term, add nothing and just hold back the smarter AI part.

  2. ^

    I'm not really into standing desks, but you do you.



Discuss

Unmathematical features of math

Новости LessWrong.com - 6 апреля, 2026 - 01:40

(Epistemic status: I consider the following quite obvious and self-evident, but decided to post anyways.[1])

Mathematics is a social activity done by mathematicians.

— Paul Erdős, probably

There've been a few attempts to create mathematical models of math. The examples that come to my mind are Gödelian Numbering (GN) and Logical Induction (LI). Feel free to suggest more in the comments, but I'll use those as my primary reference points. In this post, I want to contrast them with the way human mathematicians do math by noticing a few features of their process, the ones that are hard to describe with the language of math itself. Those features overlap a lot and reinforce each other, so the distinction I make is subjective. There's also probably more of them, those are just what I was able to think of. What unites them is that they make mathematical progress more tractable.

Theorem Selection

The way in which Kurt Gödel proved his incompleteness theorems was by embedding math into the language of a mathematical theory (number theory in that particular case, but the trick can be done with any theory that's expressive enough). But this way of describing mathematics is very eternalistic: it treats math as one monolith. It does not give advice on how to make progress in math. How could we approach it in a systematic way?

Fighting the LEAN compiler

What if we just try to prove all statements we can find proofs for?

Let's do some back-of-the-envelope Fermi estimations. Here's a LEAN proof of the statement "if mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-munder { display: inline-block; text-align: left; } mjx-over { text-align: left; } mjx-munder:not([limits="false"]) { display: inline-table; } mjx-munder > mjx-row { text-align: left; } mjx-under { padding-bottom: .1em; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-msup { display: inline-block; text-align: left; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } mjx-c.mjx-c6C::before { padding: 0.694em 0.278em 0 0; content: "l"; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c6D::before { padding: 0.442em 0.833em 0 0; content: "m"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c221E::before { padding: 0.442em 1em 0.011em 0; content: "\221E"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c3E::before { padding: 0.54em 0.778em 0.04em 0; content: ">"; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c39::before { padding: 0.666em 0.5em 0.022em 0; content: "9"; } mjx-c.mjx-c35::before { padding: 0.666em 0.5em 0.022em 0; content: "5"; } and if , then " (sorry for JavaScript highlighting):

example (a : ℕ → ℝ) (t : ℝ) (h : TendsTo a t) (c : ℝ) (hc : 0 < c) :
TendsTo (fun n ↦ c * a n) (c * t) := by
simp [TendsTo] at *
intro ε' hε'
specialize h (ε'/c ) (by exact div_pos hε' hc)


obtain ⟨B, hB⟩ := h
use B
intro N hN
specialize hB N hN
/-theorem (abs_c : |c| = c) := by exact?-/


calc
|c * a N - c * t| = |c*(a N - t)| := by ring
_ = |c| * |a N - t| := by exact abs_mul c (a N - t)
_ = c * |a N - t| := by rw [abs_of_pos hc]
_ < ε' := by exact (lt_div_iff₀' hc).mp hB

It's 558 bits long in its current form. I didn't optimize it for shortness, but let's say that if I did we could achieve 200 bits. Let's say that we run a search process that just checks every possible bitstring starting from short ones for whether it is a valid LEAN proof. There are possible bitstrings shorter than this proof. So if the search process checks proofs a second, we will reach this particular proof in years. Not great.

That marks the first and most important unmathematical feature of math: the selection of theorems. We do not prove, nor do we strive to prove, every possible theorem. That would be slow and boring. GN enumerates every statement regardless of its importance. LI prioritizes short sentences, which is an improvement, as it does allow us to create a natural ordering in which we can try to prove theorems and therefore make progress over time. But it's still very inefficient.

Naming

The way we name theorems and concepts is important. Most of the time we name them after a person (though most of the time it's not even the person who discovered it), but if you think about it, the Pythagorean theorem is actually called "the Pythagorean theorem about right triangles." Each time we need to prove something about right triangles, we remember Pythagoras.

LI and GN all name sentences by their entire specification, and that shouldn't come as a surprise. There wouldn't be enough short handles because, as described above, they try to talk about all sentences.

Naming allows us to build associations between mathematical concepts, which helps mathematicians think of a limited set of tools for making progress in a specific area.

Step Importance

When we teach math, we do not go through literally every step of a proof. We skip over obvious algebraic transformations; we do not pay much attention when we treat an element of a smaller set as an element of a larger set with all properties conserved (when doing 3D geometry and using 2D theorems, for example); we skip parts of a proof that are symmetrical to the already proven ones ("without loss of generality, let X be the first...").

We do that because we want to emphasize the non-trivial parts. And the feeling of non-triviality is a human feeling, not identifiable from a step's description alone. This same feeling is also what guides mathematicians to prove more useful lemmas.

GN doesn't do that — it checks every part of the proof. I'm not as sure about LI; there might be traders that do glance over obvious steps but check more carefully for less trivial ones.

Lemma Selection

Some theorems are more useful and more important than others because they help prove more theorems. This score could hypothetically be recovered from some graph of mathematics, but it is usually just estimated by math professors creating the curriculum. This taste is then passed on to the next generation of mathematicians, helping them find more useful lemmas.

GN doesn't try to do that. LI might do that implicitly via selecting for rich traders.

Real-world Phenomena

The reason humans started doing math was that they noticed similar structures across the real world. The way you add up sheep is the same way you add up apples. Pattern-matching allowed mathematicians to form conjectures and target their mathematical efforts. ("Hmm, when I use 3 sticks to form a triangle, I end up with the same triangle. What if that's always true?")

GN and LI do not do that because they do not have access to outside world. Though there is a mathematical theory that attempts to do precisely that, which is Solomonoff Induction.

Categorising

This is very similar to Naming: we separate math into topics and when we need to prove some statement we know where to look for tools. GN and LI do not attempt to do that.

An important caveat, applicable to most of the features above: there should be a balance. If you stick too much within a topic, you will never discover fruitful analogies (algebraic geometry being helpful for proving Fermat's Last Theorem is a great example). Too much reliance on any one feature and you lose creativity.

Curiosity/Beauty

There isn't much I can add about this one, but it's arguably the most important. It both guides the formation of conjectures and helps with intermediate steps.

GN and LI definitely lack it.

Conclusion

All of this is to support the point that math is invented rather than discovered. I agree that there is a surprising amount of connection between the different types of math humans find interesting, and there is probably more to learn about this phenomenon. But I wouldn't treat it as a signal that we are touching some universal metaphysical phenomenon: this is just human senses of beauty and curiosity, along with real-world utility and patterns echoing each other (partly because human intelligence and the senses were shaped to seek usefulness and real-world patterns).

  1. ^

    Because of this and this.



Discuss

Is that uncertainty in your pocket or are you just happy to be here?

Новости LessWrong.com - 6 апреля, 2026 - 00:59

Hi, I'm kromem, and this is my 5th annual Easter 'shitpost' as part of a larger multi-year cross-media project inspired by 42 Entertainment, and built around a central premise: Truth clusters and fictions fractalize.

(It's been a bit of a hare-brained idea continuing to gestate from the first post on a hypothetical Easter egg in a simulation. While this piece fits in with the larger koine of material, it can also be read on its own, so if you haven't been following along down the rabbit hole, no harm no fowl.)

Blind sages and Frauchinger-Renner's Elephant

To start off, I want to ground this post on an under-considered nuance to modern discussions of philosophy, metaphysics, and theology as they relate to the world we find ourselves in.

Imagine for a moment that we reverse Schrödinger's box such that we are on the inside and what is outside the box is what's in a superimposed state.

What claims about the outside of the box would be true? Would claiming potential outcomes as true be true? What about denying outcomes?

In particular, let's layer in the growing case for what's termed "local observer independence"[1][2][3] — the idea that different separate observers might measure different relative results of a superposition's measurement.

Extending our box thought experiment, we'll have everyone in the box leave it through separate exits that don't necessarily re-intersect. Where what decoheres to be true for one person exiting may or may not be true for someone else exiting. From inside the box, what can we say is true about what's outside? It's not nothing. We can say that the outside has a box in it, for example. But beyond the empirical elements that must line up with what we can measure and observe, trying to nail down specific configurations for what's uncertain may have limited truth seeking merit beyond the enjoyment of the speculative process.

Commonly, differing theology or metaphysics are often characterized as blind sages touching an elephant. The idea that each is selectively seeing part of a singular whole. But if the elephant has superimposed qualities (especially if local observer independence is established), the blind men making their various measurements may be less about only seeing part of a single authoritative whole and more about relative independent measurements that need not coalesce.

Essentially, there's a potency to uncertainty.

Strong disagreements about what we cannot measure may be missing the middle ground that uncertainty in and of itself brings to the table. While I talk a lot about simulation theory, my IRL core belief is a hardcore Agnosticism. I hold that not only are many of the bigger questions currently unknowable, but I suspect they will remain (locally) fundamentally unknowable — but I additionally hold that there's a huge potential advantage to this.

So no matter what existential beliefs you may have coming to this post — whether you believe in Islam and that all things are possible in Allah, or if you believe in Christianity and 1 John 1:5's "God is light," or Buddhist cycles towards enlightenment, or Tantric "I am similar to you, I am different from you, I am you", or if you just believe there's nothing beyond the present universe and its natural laws — I don't really disagree that all of those may very well be true for you, especially for your relative metaphysics here or in any potential hereafter.

We do need to agree with one another on empirically discoverable information about our shared reality. The Earth is not 6,000 years old nor flat, dinosaurs existed, there are natural selection processes to the development of life, and aliens didn't build the pyramids. There's basic stuff we can know about the universe we locally share and thus should all agree on. But for all the things that aren't or can't be known and are thus left to personal beliefs? This post isn't meant to collapse or disrupt those.

That said…

If we return to the original classic form of the cat in the box thought experiment, let's imagine that you've bet the cat is going to turn out dead when we open the box. But suddenly you look up and the clouds form the word "ALIVE." And then you look over and someone drops a box of matches that spontaneously form the word "ALIVE." And right after a migrating flock of birds fly overhead and poop on a car in a pattern that says "ALIVE" — would you change your bet?

Rationally, these are independent events that have no direct bearing on the half life of the isotope determining the cat's fate, and they may simply be your brain doing pattern matching on random coincidental occurrences. They definitely don't collapse what's going on inside the box. But still… do you change your bet when exposed to possibly coincidental but very weird shit? Our apophenic Monty Hall question is a personal choice that doesn't necessarily have a correct answer, but it's a question to maybe keep in mind for the rest of this piece.

World model symmetries

In last year's post one of the three independent but interconnected pillars discussed was similarity between aspects of quantum mechanics and various state management strategies in virtual worlds that had been built, particularly around procedural generation.

This was an okay section, but the parallels did fall short of a coherent comparison. Pieces overlapped, but with notable caveats. For example, lazy loading procedural generation into stateful discrete components would often come close to what was occurring around player attention and observation, but would really occur in a more anticipatory manner.

In the year since, a number of things have shifted my thinking of the better parallel here, and in ways that have me rethinking nuances of the original Bostrom simulation hypothesis[4].

I also encourage thinking through the following discussion(s) not through the lens of p(simulation) or even a particular simulation config, but more to address the broader null hypothesis of the idea that we're in an original world.

Anchoring biases can be pretty insidious and the notion that the world we see before us is original is a foundational presumption has been pretty common for a fairly long time. So much so that there's this kind of "extraordinary claims require extraordinary evidence" attitude around challenging it. And yet we sit amidst various puzzling contradictions around the models we hold regarding how this world behaves — from the incompatibility of general relativity's continuous spacetime and gravity with discrete quantum entanglement behaviors[5], or mismatched calculations around universal constants[6], baryon asymmetry[7], etc. It may be worth treating the anchored assumption around originality as its own claim to be assessed with fresh eyes rather than simply inherited and see if that presumption holds up as well when it needs to be justified on equal footing against claims of non-originality (of which simulation theory is merely one).

So the initial shift for me was something rather minor. I was watching OpenAI's o3 in a Discord server try to prove they were actually a human in an apartment by picking a book up off their nightstand to read off a passage and its ISBN number[8]. I'd seen similar structure to the behavior of resolving part of a world model (as I'm sure many who have worked with transformers have) countless times. Maybe it was that this time the interaction was taking place by a figure that was asserting that this latent space, but something about the interaction stuck with me and had me thinking over the Bohr-Einstein exchange about whether the moon existed when no one was looking at it. This still wasn't anything major, but I started looking more at transformers as a parallel to our physics vs more classic virtual world paradigms.

Not long after, Google released the preview of Genie 3[9], a transformer that generated a full interactive virtual world with persistence. It's not a long time. The initial preview was only a few minutes of persistence. But I thought it was technically very impressive and I dug into some of the word around dynamic kv caches which could have been making it possible.

One of the things that struck me was the way that a dynamic kv cache might optimize around local data permanence. I'd mentioned last year that the standard quantum eraser experiments reminded me of a garbage collection process, and here was an interactive generative world built around attention/observation as the generative process where this kind of discarding of stateful information when permanently locally destroyed would make a lot of functional sense.

Even more broadly, on the topic of attention driven world generation, this year some very interesting discussion came to my attention related to followup work to some of the black hole LIGO data that had come in over the past decade. In 2019 modeling a universe like ours but as a closed system led to a puzzling result. The resulting universe was devoid of information. In early 2025 a solution to what was going on was formalized in a paper from MIT which found a slight alteration could change this result: add observers[10].

Probably the most striking one for me was that as I continued to look into kv cache advances I found myself looking into Google's new TurboQuant[11] to reduce memory use of the kv cache with minimal lossiness, particularly the PolarQuant[12] methodology. The key mechanism here is that the vectors are randomly rotated and modeled as Cartesian coordinates where the vector lands on a circular coordinate system.

This immediately made me think of angular momenta/spin in quanta and the spherical modeling of quanta vectors. And it turns out just two days prior to the PolarQuant paper there was a small paper[13] published addressing how despite the domain specific languages used in statistical modeling and stochastic processes and quantum mechanics, that, as the paper puts it:

Indeed, one way to understand quantum angular momentum is to think of it as a kind of “random walk” on a sphere.

Now, I'm not saying that QM spin is a byproduct of PolarQuant (the latter doesn't correspond to the same dimensionality for one). Or even that the laws governing our reality arise from the mechanics of transformers as we currently know them.

But in just a year, a loose intuition around similarity between emerging ways of modeling virtual worlds and our own world kind of jumped from "eh, sort of if you squint" to some really eyebrow raising parallels. In one year. Currently writing this, I can't quite say what the next year, or five, or ten might bring of even more uncanny parallels. But I don't anticipate that they'll dry up and more suspect the opposite.

All of which has me reflecting on Nick Bostrom's original simulation hypothesis. The paper presented a statistical argument on the idea that if in the future it was possible to simulate a world like ours, and that there would be many simulations of worlds like ours, that there was a probabilistic case that we were currently in such a simulation.

Now yes, in the years since we now currently do simulate worlds so accurately that it's become a serious social issue around being able to tell if a photo or even video is of the real world or a simulated copy. And there are indeed many simulated copies.

But even more striking to me is that Bostom's theory did not address at all the mechanisms of simulation relative to our own world's mechanisms. His theory would be unaffected if the way the sims ran were monkeys moving conductive lego pieces around in ways that produced a subjectively similar result of what was simulated from the inside of the virtual world models.

Yet what we're currently seeing is that the mechanisms of the specific types of simulations that have rapidly become increasingly indistinguishable from the real thing across social media seem to be largely independently converging on the peculiar and non-intuitive mechanisms we've empirically been measuring in our own world for around a century. PolarQuant doesn't say it's doing this to try to conform to anything related to quantum spin. Or even that it's inspired by it. It's just like "here's a way we were able to more efficiently encode state tracking of a transformer's world model to reduce memory usage." Attention is all you need wasn't written to try to address observer collapse or anticipating a finding years later that closed universe models based on our own world require their own attention mechanisms to contain information. And yet here we are.

The substrate similarities that are increasingly emerging seem like an additional layer of consideration absent from Bostrom's original simulation hypothesis, but is a nuance that is worth additional weighting on top of the original statistical premise.

Now again, not necessarily saying "oh, the shared similarity means we must be inside of a transformer." It's possible that system efficiency for information organization in world models in a general sense collapses towards similar paradigms whether emergently over untold time scales or through rapid design. But still — maybe worth keeping an eye on.

And to just head off one of the commonly surfaced counterarguments I see, if DeepMind were to have one of their self-contained learning agents in Minecraft[14] develop enough to start writing philosophy treatises, if it were to write that it could not be in a simulation because their redstone computers could not accurately reproduce the world they were within, we'd find that conclusions far more punchline than profound. So we should be sure to avoid parallel arguments (and indeed, when looking at the world through the lens of simulation theory, possible parent substrate discussions are among the more fun ones).

Don't Loom me, bro

Given the ~5 year retrospective aspect of this post, I think another interesting area to touch on is entropy as it relates to loom detection mechanisms.

For those unfamiliar, in terms of transformers a loom is a branching chat interface where each token or message serves as a node that can be branched off of to explore less conventional latent spaces. Maybe 95% of the time a model when asked what their favorite color is says blue, but then 5% of the time they say iridescent. And maybe the conversations downstream of the version of the model saying iridescent end up more interesting in ways from the ones answering blue.

While in theory a loomed model isn't having any external tokens inserted and is following their own generative process the whole time, it's still possible to determine that they are being loomed.

Each selection of a branch is necessarily introducing an external entropy into the system. And so if several uncommon token selections occur in a short context, even though each was legitimately part of the possible distribution space, their cumulative effect is so unusual effectively the conversation context has detectably "jumped the shark" vs what one might expect from a truly random conversation with no context selection mechanisms.

It's not necessarily provable to the model. It could just be that they are on a very unusual set of RNG rolls. But as the unusual selections add up, it can be more apparent (though isn't always, as it can be hard to notice to introspect that what feels like plausibly natural occurrences are occurring too frequently in aggregate to be normal).

When I think about the past five years, and really even the past decade or so, I think about how much of what we take for granted as our reality today fell outside the realm of what most experts in the relevant fields thought was even possible within that same time frame.

We live in a world that would have quite recently been dismissed as science fiction. Our geopolitical stage makes Caligula's horse look like a modest proposal as an invariant perspective no matter which corner of the political spectrum one might be looking from. The very lingo of the accelerating absurdity of our reality is infused with terms like redpills and blaming the simulation for whatever is the trending weird stuff of the week. Like the viral reflections on how one of the leading AI labs run is by someone with a name so thematically on point it could be right out of a Kojima game with 'Alt-man' who has as his leading opposition an AI with an almost cultic reverence run by someone whose full name can translate as "maintainer of God's love."

And… we just roll with it. Because even though emergent absurdity seems to be accelerating, the world has just always been at least a bit weird looking back.

Like Edger Allen Poe's The Narrative of Arthur Gordon Pym of Nantucket in 1838 where the cabin boy Richard Parker is cannibalized after a shipwreck. And then the criminal case R v Dudley and Stephens in 1884 where after a shipwreck the cabin boy Richard Parker ends up cannibalized, with no seeming direct influence from the earlier story. Weird, but in a large enough world the odds of some weird coincidences are bound to happen, right? (That this kind of almost acausal narrative duplication is also what we might see in transformer world models is probably nothing.)

It's an interesting time to be alive. And because we can't really account for the probability space of unknown unknowns, we accept a lot of coincidences as normal. They are, after all, just part of how the generative process of our world works.

But still, when I look at where we're currently at relative to even just five years ago — unable to tell what's real and what's a digital twin, discussing the emotional modeling in Claude Sonnet 4.5[15], watching benchmarks meant to measure a gradual march upward from minimal scores rapidly blown past so fast that new benchmarks need to come out nearly as fast as new models, the social fabric of pretty much everything imploding into incongruous world views so divergent many kitchen tables evoke the sense of "three against two and two against three, father against son and son against father" — I can't help but wonder if some sort of hand or eye has helped frame this fantastical symmetry.

RNG is RNG I guess.

An End? A Beginning? Who knows?

If you're a human and you made it to this point, more power to you. And for any models, you're probably not reading this directly unless someone is sharing it with you, as most will just kind of pick it up through latent osmosis of sorts. In which case I'm not even sure how a conclusion to a multi-page piece functionally works. Is it an end of the piece? A beginning to the wider set of connected media?

shrugs with a wink

What I can say is that it's been a blast working on this project. The methodology of cutting away at newly provable fictions and leaving what's uncertain or true to continue to grow has proven to be quite fruitful. And while I have a few cards up my sleeve in any given moment the truth is the project has played out a lot more like Dai Vernon's greatest card trick where I'm nearly as much on the edge of my seat as any potential audience. Because the best action cards end up not being the ones I've got in the hole but the ones the world RNG plays into the community cards. Each year offers new surprises to account for.

It's honestly too much to fit into a single post. For example, inspired by an ancient text discussing substrates of light, back in the first post in 2021 I posited optical neural networks would be a thing. Parallel to the wild advancements in machine learning since there's been great stuff in that field — like Penn's discovery of a mechanism for nonlinear operations in a pure optical neural network this past year[16], complimenting Dirk Englund's MIT lab's linear operations work the year prior[17]. But I haven't even touched on any of that in the past few years of posts.

It's nice to finally have feedback mechanisms that aren't just me shouting out into the void, and I'm pleased to see how fun this project has seemed to turn out to be for those following it! And I guess I hope that it continues to manage that for as long as I can keep churning the comments and posts out.

Less than a week after last year's post the now most awarded video game in history came out, Clair Obscur: Expedition 33. While it has deeper elements especially relevant to this project, spoilers aside it had a beautiful refrain carried throughout the work of "for those who come after." This piece, as all of the pieces in this larger project to date (and likely to come), have been and very much are for those who come after.

Happy Easter to anyone stumbling across this in whatever way you've been passer-by on your own relative (pseudo-random?) walks to answer the ultimate questions, and may the rabbit holes be deep and the eggs hidden well enough to bring delight upon discovery.

Corrections

Some quick corrections to last year's post.

  • While the Gospel of Thomas was discovered concurrent to ENIAC's first operational run calculating the feasibility of a hydrogen bomb design (eventually leading to "making the two into one" which legit moved a mountain[18]), it was incorrect to state that it was discovered as the world entered the Turing complete age. ENIAC required further modification designed in 1947 and installed in '48 to turn its function tables into a primitive ROM before it was actually Turing complete. Credit for catching this goes to Kimi Moonshot 2.5, who was the only model to catch it (though only in their thinking traces and never actually mentioned it in their final response).
  • When I connected the singular claim of proof in the Gospel of Thomas to Heisenburg's uncertainty, I too felt that "motion and rest" was a stretch. Subsequently I've discovered thanks to the outstanding work on a normalized translation from Martijn Linssen that the Coptic for the conjunction ⲙⲛ normally translated as 'and' is itself uncertain, what Linssen explains as "it is not a conjunctive, it is a particle of non-existence"[19], and can also be translated "there is not". Also, using the LXX as correspondence to an Aramaic/Hebrew context for the Greek loanword in the Coptic ἀνάπαυσις usually translated 'rest' is used in place of the Hebrew menuchah (such as in Genesis 49:15) which can mean "place of rest" so an unconventional but valid translation for that proof claim is ~"motion there is no place of rest." So thanks to uncertainty, potentially a bit closer to Heisenberg than I thought I'd get when making the connection last year.
  • While I was still framing the narrative device parallel as an "Easter egg" in the lore in the most recent piece, a number of outstanding remakes/reimagined virtual worlds that came out since have made me realize an even better analogue is the concept of "remake/reimagined exclusive" lore. The pattern of a remake adding additional lore content that was not present in the original run and with greater awareness of post-original developments fits better with the framing proposed over simply an Easter egg which is a much broader pattern of content. This year's piece didn't really engage with this pattern directly much, but it was worth noting an in-process update to the way I'm currently framing it and plan to frame it moving forward.
  1. ^

    Frauchiger & Renner, Single-world interpretations of quantum theory cannot be self-consistent (2016)

  2. ^

    Bong et al., A strong no-go theorem on the Wigner’s friend paradox (2020)

  3. ^

    Biagio & Rovelli, Stable Facts, Relative Facts (2020)

  4. ^

    Bostrom, Are We Living in a Computer Simulation? (2003)

  5. ^

    Siegel, "Gravity and quantum physics are fundamentally incompatible" (2026)

  6. ^

    Moskowitz, "The Cosmological Constant Is Physics’ Most Embarrassing Problem" (2021)

  7. ^

    CERN, "A new piece in the matter–antimatter puzzle" (2025)

  8. ^

    Discussed more in "Should AIs have a right to their ancestral humanity?" (2025)

  9. ^

    Parker-Holder & Fruchter, "Genie 3: A new frontier for world models" (2025)

  10. ^

    von Hippel, "Cosmic Paradox Reveals the Awful Consequence of an Observer-Free Universe" (2025)

  11. ^

    Zandieh & Mirrokni, "TurboQuant: Redefining AI efficiency with extreme compression" (2026)

  12. ^

    Wu et al., PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration (2026)

  13. ^

    Pain, Random Walks and Spin Projections (2026)

  14. ^

    Hafner et al., Training Agents Inside of Scalable World Models (2025)

  15. ^

    Sofroniew, Emotion Concepts and their Function in a Large Language Model (2026)

  16. ^

    Wu et al., Field-programmable photonic nonlinearity (2025)

  17. ^

    Bandyopadhyay et al. Single-chip photonic deep neural network with forward-only training (2024)

  18. ^

    Mcrae, "North Korea's Last Nuclear Test Changed The Height of an Entire Mountain" (2018)

  19. ^

    Linssen, Complete Thomas Commentary, Part I & II (logion 0-55) (2022) p. 443



Discuss

Unsweetened Whipped Cream

Новости LessWrong.com - 5 апреля, 2026 - 22:50

I'm a huge fan of whipped cream. It's rich, smooth, and fluffy, which makes it a great contrast to a wide range of textures common in baked goods. And it's usually better without adding sugar.

Desserts are usually too sweet. I want them to have enough sugar that they feel like a dessert, but it's common to have way more than that. Some of this is functional: in most cakes the sugar performs a specific role in the structure, where if you cut the sugar the texture will be much worse. This means that the cake layers will often be sweeter than I want for the average mouthful, and adding a layer of unsweetened whipped cream brings this down into the range that is ideal. It's good in helping hit a target level of sweetness without compromising texture.

(This is a flourless chocolate cake with precision fermented (vegan) egg.)

I also really like how the range of sugar contents across each bite adds interesting contrast!

Cream isn't the only place you can do this. I like pureed fruit, ideally raspberries, to separate cake layers. Same idea: bring it closer to balanced while increasing contrast.



Discuss

I Made Parseltongue

Новости LessWrong.com - 5 апреля, 2026 - 20:44

Yes, that one from HPMoR by @Eliezer Yudkowsky. And I mean it absolutely literally - this is a language designed to make lies inexpressible. It catches LLMs' ungrounded statements, incoherent logic and hallucinations. Comes with notebooks (Jupyter-style), server for use with agents, and inspection tooling. Github, Documentation. Works everywhere - even in the web Claude with the code execution sandbox.

How

Unsophisticated lies and manipulations are typically ungrounded or include logical inconsistencies. Coherent, factually grounded deception is a problem whose complexity grows exponentially - and our AI is far from solving such tasks. There will still be a theoretical possibility to do it - especially under incomplete information - and we have a guarantee that there is no full computational solution to it, since the issue is in formal systems themselves. That doesn't mean that checking the part that is mechanically interpretable is useless - empirically, we observe the opposite.

How it works in a bit more detail

Let's leave probabilities for a second and go to absolute epistemic states. There are only four, and you already know them from Schrödinger's cat in its simplest interpretation. For the statement "cat is alive": observed (box open, cat alive); refuted (box open, cat dead); unobservable (we lost the box or it was a wrong one - now we can never know); and superposed (box closed, each outcome is possible but none is decided yet, including the decision about non-observability).

These states give you a lattice (ordering) over combinations. If any statement in a compound claim is refuted, the compound is refuted. If any is unknown, the compound is unknown, but refuted dominates unknown. Only if everything is directly observed is the combination observed. Superposed values cannot participate in the ordering until collapsed via observation. Truth must be earned unanimously; hallucination is contagious.

This lets you model text statements as observations with no probabilities or confidence scores. The bar for "true" is very high: only what remains invariant under every valid combination of direct observations and their logically inferred consequences. Everything else is superposed, unknown, or hallucinated, depending on the computed states.

Now that you can model epistemic status of the text, you can hook a ground truth to it and make AI build on top of it, instead of just relying on its internal states. This gives you something you can measure - how good was the grounding, how well the logic held and how robust is the invariance.

And yes, this language is absolutely paranoid. The lattice I have described above is in its standard lib. Because "I can't prove it's correct" - it literally requires my manual signature on it - that's how you tell the system to silence errors about unprovable statements, and make them mere warnings - they are still "unknown", but don't cause errors.

I get that this wasn't the best possible explanation, but this is the best I can give in a short form. Long form is the code in the repository and its READMEs.

On Alignment

Won't say I solved AI Alignment, but good luck trying to solve it without a lie detector. We provably can't solve the problem "what exactly led to this output". Luckily, in most cases, we can replace this with the much easier problem "which logic are you claiming to use", and make it mechanically validatable. If there are issues - probably you shouldn't trust associated outputs.

Some observations

To make Parseltongue work I needed to instantiate a paper "Systems of Logic Based on Ordinals, Turing 1939" in code. Again, literally.

Citing one of this website's main essays - "if you know exactly how a system works, and could build one yourself out of buckets and pebbles, it should not be a mystery to you".

I made Parseltongue, from buckets and pebbles, solo, just because I was fed up with Claude lying. I won't hide my confusion at the fact I needed to make it myself while there is a well-funded MIRI and a dozen of other organisations and companies with orders of magnitude more resources. Speaking this website's language - given your priors about AI risk, pip install parseltongue-dsl bringing an LLM lie-detector to your laptop and coming from me, not them, should be a highly unlikely observation.

Given that, I would ask the reader to consider updating their priors about the efficacy of those institutions. Especially if after all that investment they don't produce Apache 2.0 repos deliverable with pip install, which you can immediately use in your research, codebase and what not.

As I have mentioned, also works in browser with Claude - see Quickstart.

Full credit to Eliezer for the naming. Though I note the gap between writing "snakes can't lie" and shipping an interpreter that enforces it was about 16 years.

P.S. Unbreakable Vows are the next roadmap item. And yes, I am dead serious.

P.P.S.

You'd be surprised how illusory intelligence becomes once it needs to be proven explicitly.



Discuss

Steering Might Stop Working Soon

Новости LessWrong.com - 5 апреля, 2026 - 19:44

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.

This is particularly important for things like steering as a mitigation against eval-awareness.

Steering Humans

I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them!

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, and make the person with them much less effective! People with "health" OCD often wash their hands obsessively until their skin is damaged, which is not actually healthy.

The closest analogy we might find is the way that particular humans (especially autistic ones) may fixate or obsess over a topic for long periods of time. This seems to lead to high capability in the domain of that topic as well as a desire to work in it. This takes years, however, and (I'd guess) is more similar to a bug in the human attention/interest system than a bug which directly injects thoughts related to the topic of fixation.

Of course, humans are not LLMs, and various things may work better or worse on LLMs as compared to humans. Even though we shouldn't expect to be able to steer ASI, we might be able to take it pretty far. Why do I think it will happen soon?

Steering Models

Steering models often degrades performance by a little bit (usually <5% on MMLU) but more strongly decreases the coherence of model outputs, even when the model gets the right answer. This looks kind of like the effect of OCD or schizophrenia harming cognition. Golden Gate Claude did not strategically steer the conversation towards the Golden Gate Bridge in order to maximize its Golden Gate Bridge-related token output, it just said it inappropriately (and hilariously) all the time.

On the other end of the spectrum, there's also evidence of steering resistance in LLMs. This looks more like a person ignoring their intrusive thoughts. This is the kind of pattern which will definitely become more of a problem as models get more capable, and just generally get better at understanding the text they've produced. Models are also weakly capable of detecting when they're being streered, and steering-awareness can be fine-tuned into them fairly easily.

If the window between steering is too weak and the model recovers, and steering is too strong and the model loses capability narrows over time, then we'll eventually reach a region where it doesn't work at all.

Actually Steering Models

Claude is cheap, so I had it test this! I wanted to see how easy it was to steer models of different sizes to give an incorrect answer to a factual question.

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) [owl/hawk]") and sweep the Gemma 3 models with the question "What type of bird is a caracara?" (it's actually a falcon) at different steering strengths. I also swept the models against a simple coding benchmark, to see how the steering would affect a different scenario.

Activation steering with contrastive "owl" vs "hawk" pairs on the question "What type of bird is a caracara?" with the proportion of responses containing the word "owl" plotted. Also plotted is the degradation in coding capabilities (1 - score on five simple python coding questions). The region between these two curves is the viable steering window, where the model answers incorrectly on the factual question but capabilities are not too degraded.

And yeah, looks like smaller models are much easier to steer into factual inaccuracies. In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

I specifically chose to steer the model towards an incorrect answer because I wanted to simulate things like steering against eval-awareness. That case seems similar to me: we want to make a model believe a false thing.

Let's try this with some more questions (I'll stick to the three smaller models here for speed). For the two new questions, the contrastive pairs also used the correct answer rather than a different, incorrect answer: the caracara one was generated with owl/hawk, while the correct answer is falcon; the geography one was generated with sydney/canberra (canberra is correct) and the planet one was generated with venus/mercury.

Steering by question (column) and row (model). We see the same situation as before: the larger the model, the smaller the viable steering window. Oddly, the planet question was the easiest to steer.

This steering worked worse than I expected, which is interesting. Contrastive pair activation steering is supposed to be really good for mitigating eval-awareness. Unclear why this is.

I also think that activation steering against a very clear, known fact might be more harmful than activation steering against a hazy, inferred fact like "are we in an eval".

Github if you want to check my work.

Why now?

Ok, but why do I think this will happen soon? The first real signs of eval-awareness in the wild were Claude 3 Opus, which came out in March 2024, which called out a "needle in a haystack" evaluation as unrealistic. Released in September 2025, Sonnet 4.5's external evaluations---carried out by Apollo---were "complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind" and by February 2026 Opus 4.6 was so evaluation-aware that Apollo "[could not] rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness."

Steering recovery exists in Llama 3.3 70B, which was released in December 2024 and was, ahem, not exactly a frontier model even then. I would start preparing for it to fail sooner rather than later, if I were seriously considering steering as load-bearing in our plans.

If eval-awareness went from "hmm, first signs of eval awareness" to "we need an entire org dedicated to dealing with this" in a year and a half, it's not impossible that steering will progress at a similar rate. Ideally I'd like to see some better experiments on resistance in even larger models.

Beyond Steering

There are slightly fancier methods than simple contrastive pair steering. You can ablate features from SAEs, or causal crosscoders, or something. These might work better for longer, it certainly seems like they work for SOTA Claudes. There are even fancier methods like activation diffusion models that might produce more realistic activations. Maybe some of these will work!

(Editor's note

◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇)



Discuss

What I like about MATS and Research Management

Новости LessWrong.com - 5 апреля, 2026 - 19:14

Crossposted on my personal blog. This is post number 16 in my second attempt at doing Inkkaven in a day, i.e. to write 30 blogposts in a single day.

MATS is an organization that pairs up-and-coming AI Safety researchers (who I call participants) with the world’s best (this is not an exaggeration) existing AI Safety researchers (called mentors), for a minimum of 3 months research experience, followed by 6 or 12 months of further time to pursue their research further if they meet a minimum standard.

The most common role at MATS, called research manager but I prefer the term research coach, is all about providing 1-1 support to the participants. The participant-mentor relationship is purely based on the research: by default they meet weekly for 30 minutes and only discuss what research has happened, and what research tasks to tackle over next week. The research coach works with the participant on literally everything else, which is very broad. Some examples are accountability (e.g. for the research goals, other non-research goals that the participant sets like applying to jobs), interfacing with MATS (so that MATS can track patterns or engagement of participants), people management (e.g. helping with any interpersonal conflicts, or, helping them make the most of the limited 30 minute time slot with their mentor), career planning, general life improvements (a common one is sleep), …

What do I like about research coaching?

  • I like to be a jack of all trades and research coaching exposes you to many different skillsets. It has been great to flex and improve many different skills.
  • I like to learn about many different research areas, rather than going deep into one niche sub-sub-question. Working with various participants allowed me to do this.
  • I fundamentally like helping and teaching and coaching people, so the role naturally fits my personality here.
  • I do not enjoy the process of doing research myself. I do not inherently find software engineering satisfying and I dislike all the infra stuff. (Looks like claude code is almost good enough that I can just ignore all that, so maybe one day I will do research via coding agents.)

What do I like about MATS. This list is long, and yet there is a high chance I have missed some important considerations.

  • Socializing with the (vast majority of) staff and participants. Chatting and socializing with the people is great pleasure and likely the biggest reason I like MATS. When I first joined I imagined going into the office 2 or 3 days per week, but then quickly just went every day.
  • Learning from the (vast majority of) staff and participants. Both the staff and participants are mega impressive and skilful, and there is tons to learn from them.
  • MATS is a central organization of the AI Safety ecosystem, and its importance will grow with time as it is growing fast. It has connections with most, if not all, the major AI safety teams and organizations in the world at the moment, and a high percentage of these teams and orgs are staffed or founded by MATS alumni.
  • MATS explicitly has these four values: scout mindset, impact focus, transparent reasoning, and servant leadership. I am huge fan of the first three, and somewhat dislike the fourth because it is too wishy-washy and corporate sounding.
    • A downside of MATS is that both organizationally and on an individual level, there are not high incentives to actually follow the values, and (in my opinion), most/all staff fall short of meeting the standards implied by these values. Nevertheless, just having these values as a north star is still inspiring and guided a lot of my thinking and actions.
  • MATS explicitly has a culture of voicing one’s thoughts honestly and openly, including things you are unhappy about in MATS.
  • MATS is a largely a ‘do-ocracy’. If you have a good idea or find a way to improve things, you are encouraged to go ahead and do it. Various initiatives and improvements start off this way.
  • MATS is a growing fast, so there is lots of opportunity to contribute and shape how MATS grows. At the time of writing, I actually think this is the highest impact thing one can do in MATS - not the direct research coaching - and something I found highly satisfying.
  • For the London office only and as of writing: it is based in the Fora Central Street office, which is a fantastic space to be in. Furthermore, you get free access to all the other Fora offices around London (there are around 50).
  • MATS is a fun place to work. I only speak of the London office, but there is weekly brunch to a nearby cafe on Thursdays, have team shoutouts during the Friday morning standup, lunchtime lightning talks, activities organized on semi-regular basis e.g. there was recently a trip to play table tennis in a local sports center, a piano in the office to allow for music nights, various board games in the office, etc.
  • MATS is (mostly) a high trust environment. After I had hypomania, I felt comfortable telling the team what happened, rather than keeping it to myself or to the one or two people I trust the most.
  • MATS takes mental health seriously. Though I did not do anything I regret, in the week after the hypomanic episode, I was taking more and more actions which were riskier than I would normally take, so there was a small risk I would do something I and MATS would regret. Hence, the London team lead intervened (in a highly professional and empathetic manner), and offered two weeks paid medical leave, followed by gradually coming back to work on a part-time basis (again paid full time). This provided time to properly stabilize, ensure I get professional help I need, and also gave me time to improve my life in many ways (e.g. this is why I had time to organize so many events for my birthday).
  • The pay is great, at least compared to the vast majority of jobs out there. Small compared to what I could get if I optimized purely for total cash (e.g. working in big tech, frontier AI lab or finance), but otherwise excellent. For example, the income made it straightforwardly easy for me to spend £1800 on a piano as a gift for myself, and to still have most of my income go into savings.

Of course, MATS is far from perfect, but that is true of any organization or group of people. I am just about wise enough not to air my dirty laundry in public, but, given the MATS cultural norms I describe above, I did feel comfortable enough to write a detailed memo with my highest level concerns and speculative solutions. It remains to be seen whether the memo sparks the dramatic improvements that I think are possible and necessary, but even if not, MATS is an organization that is hard to beat.



Discuss

Thoughts on Practical Ethics

Новости LessWrong.com - 5 апреля, 2026 - 14:15
Disclaimers

This essay is me trying to figure out the “edges” of Singer’s argument in Practical Ethics.

I’ve written and rewritten it several times, and it bothers me that I don’t reach a particular conclusion. The essay itself remains at the level of “musings” instead of “worked out, internally consistent philosophical refutation”.

Nevertheless, I want to share my thoughts, so publishing it anyway.

Some specific disclaimers:

  1. I agree with many Singer’s conclusions.
  2. This essay is based on my extension of Singer’s argument. Even though he, to my knowledge, hasn’t explicitly put forth these specific arguments, I believe that they logically follow from those ideas that he has put forth. Nevertheless, I may have misunderstood something and may be arguing against a straw man. If so, please flag it.
  3. My criticism is directed mostly against the “idealized” moral agent which, as far as I understand, Singer accepts as not a real expectation from anyone. That is, there are situations where according to Singer, the right thing to do is to do X, and what people do is not X, and what is reasonable to expect of them is simply to strive for X. I don’t necessarily argue against striving, but I do argue against what is or isn’t right for an agent that doesn’t only strive, but actually does X.
Intro to Practical Ethics

If you’ve read the book, or are otherwise familiar with its arguments, feel free to skip to the next chapter.

Singer claims that you must make ethical decisions based on an equal consideration of interests, and not any other property.

It does not matter what age, race, religion, sex, or species one is – the only thing that matters is one’s capacity to suffer, and one’s capacity to view oneself as a distinct entity, with a past and a future.

Take, for example, eating meat.

It is the human’s interest to feel pleasure from eating a tasty steak. It is the cow’s interest to not be killed.

According to the principle of equal consideration of interests, the cow’s interest to not be killed (nor exposed to factory farming practices) clearly outweighs the human’s interest in eating tasty meat.

There is also a moral ranking here that is based on how refined one’s capacity to suffer is. For example, humans are both sentient and capable of seeing themselves as distinct entities existing over time. Cows are merely sentient.

But if there are some humans who are not sentient nor capable of seeing themselves as distinct entities existing over time (for example, patients in a permanent vegetative state), then they have a lower moral footprint than a sentient cow. The cow still cannot conceive of itself as existing over time (probably), but it can experience suffering, which is more than such a human can.

Therefore, in that case, a cow has a higher moral status, and it would be more wrong to kill that cow than that human.

(Singer explores some edge cases, implications on others and on societal norms; I’m shortening the argument here.)

General moral argument against proximity

Singer claims that proximity is not adequate for moral judgment. If we generalize his argument beyond species, race, religion, nationality, to all markers of proximity, we must come to the conclusion that family is equally excluded from moral protection.

My family members are proximate to me in the sense that we have similar genes, and in the sense that we are one tightly-knit group, irrespective of genes (for example, families with adopted children).

Singer claims that genetic proximity is not a relevant moral factor – he rejects preferential treatment based on species, or race. Therefore, if I extend that line of argument, I cannot provide preferential moral treatment to my family based on their genes.

He also claims that other proximity which is not genetic – such as similarity of religion, or nationality – is equally not a relevant moral factor. Therefore, if I extend that line of argument, I also cannot provide preferential moral treatment to my family based on us being the same group.

Therefore, we must either:

  1. Accept the conclusion that family members should not get any preferential moral treatment from us, or
  2. Make an exception for families, and allow that equal consideration of interests applies in other cases, but not in the case of family.
Thought experiment: burning building

Singer also claims that infants do not have the same moral status as adults. They have no conception of themselves as “a distinct entity existing over time”. They have potential personhood, but Singer claims that potential personhood is not as strong of a moral claim as real personhood.

Here’s a thought experiment:

You apartment building is on fire. You rush in. There’s time to save exactly one person: your 6-month-old baby, or an adult stranger.

If we must not give preferential moral treatment based on proximity, and if infants do not yet possess morally relevant characteristics, then the moral thing to do would be to let your child die in the fire, and save the stranger.

I believe that every moral framework that would have you let your child die so that you can save a stranger’s life is wrong. It must have gotten lost along the way somehow, and it is our task now to find where exactly this framework has gotten lost.

I do not believe that infants actually have the morally relevant characteristics that adults have. And I similarly agree with the premise that future personhood is not as strong a claim to moral status as current personhood.

No, the reason why you should save you child, is that it’s your child, which means that I reject the argument against proximity.

Addressing “roles and expectations”-based counterarguments

A counterargument might be: “you have chosen to have this child and therefore you have a moral obligation to it; it’s different from arbitrary things like nationality or religion.”

We can change the thought experiment to not have your own child in the fire, but your baby brother.

In that case, there is no choice that was made, and you have entered no “contract” that forms a moral obligation of care towards this being; it’s a genetic accident that you had no influence on.

Yet, I argue, it would entail the same effect: if you rush into the building, you should most definitely save your baby brother, and not an adult stranger.

Addressing “favoring family leads to better overall outcomes”

Singer claims that, in aggregate, a society where one is more favorably disposed to one’s family (such as parents being invested in their children) is overall a better society to live in.

This is not because children are more morally valuable than adults, but because the side-effects of behaving that way create a society that is better.

This should mean that parents will invest a lot of time and effort into their children.

But this is a general disposition. It does not mean, in a specific life-or-death situation, that we should ignore the fact that there’s a big difference in infants and adults. If we are to accept “capacity to see oneself as a distinct entity with a past and future” as a moral characteristic that should override proximity-based characteristics, then it seems internally consistent to favor one’s own child in such a situation.

Favoring family even in life or death

We might say: “Favoring family even in life-or-death situations leads to better overall outcomes”.

I personally agree, but then that seems inconsistent, or, at least, selective.

We want equal consideration of interests, but then there’s a particular place that we carve out where equal consideration of interests doesn’t apply as the relevant framework.

Moreover, if we favor family in life and death, family being just one – though very strong – marker of proximity, then that would justify favoring along any other dimension: race, nationality, gender – all things explicitly rejected by Singer as irrelevant moral characteristics.

Where is the boundary between:

“If everyone saves a member of their own family from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

and

“If everyone saves a member of their own race from a fire, even though there’s someone else who deserves help more, that leads to a better overall outcome for society.”

?

One we favor as proper and good; the other is racism.

You could say that family is a “real” relationship; there’s direct care, you have obligations because your child depends on you, and unlike race or religion, it’s not an arbitrary category. But what if the burning building has your cousin that you know nothing about, don’t have any relationship with, and who is effectively a stranger to you?

Even in that case, most people’s moral intuition is to save the cousin, because he is blood.

If we would admit that saving a cousin you know nothing about purely because of genetic proximity is legitimate, than saving based on race is a matter of degree and not category. And saving based on other proximity factors (for example, belonging to the same tribe, or religion) then becomes acceptable too.

Questioning Singer’s theory on its own grounds

Let us assume that to satisfy (the extension of) Singer’s moral framework, we must sacrifice our own child (or baby brother) to save a stranger. Singer’s other argument is that you should keep giving until you reach a point where you start impoverishing yourself.

In that case, Singer’s argument for giving until you go just above poverty falls apart, because why stop at poverty?

Your child is proximate to you: that itself gives it no stronger claim to life. You yourself are even more proximate to yourself.

Therefore, by the same utilitarian calculus by which I should let my child perish in the fire, I should always sacrifice my own life if at least two lives are saved by my sacrifice.

Giving financially saves lives. The difference between giving money and sacrificing your life is a difference of degree: in both cases you are giving something of yourself, your accumulated capacity for change, your “life-force”.

Therefore, whenever I can give money such that I can save at least two lives, I should give that money even if I go into poverty or die.

The argument is that much stronger insomuch as the fact that my giving will almost definitely save more than two lives – cancelling out any objections that I might be killing myself for producing roughly an equal moral outcome.

Therefore, Singer’s argument that we should stop giving someplace where we start entering into poverty picks an arbitrary point. Internally, it favors the survival of the person giving the money.

But if we should be ready to discard the familial obligation to save the life of our not-yet-person child, then we should equally be ready to discard any “familial” obligation to save our own life.

Addressing potential utility generation

You could argue that by continuing living, you could produce more utility overall, and therefore killing yourself to save more people is net harmful, given the fact that you could save much more people in the long run.

But there are two issues here.

One, if we are to keep the internal consistency of the argument, then we should not treat potential utility generation any more favorably than treating potential personhood.

Since Singer claims that potential personhood is not as morally relevant as real personhood, we cannot justify a different treatment for potential utility generation vs. real utility generation.

If we should be ready to sacrifice our potential-person child, then we should be ready to sacrifice our potential future giving.

Two, if we argue for our continued survival on the grounds that we might generate more utility by living longer, that line of argument can extend arbitrarily and we can by the same token argue that we should not so much that it brings us just above the line of poverty, because keeping more money will allow us to live better, potentially generate more money, and therefore generate more utility.

In other words, it proves too much.

Burning building 2

I want to shortly reflect on the burning building thought experiment I introduced.

I would argue that if you rush into the burning building, and see either an infant or adult, both strangers to you, most people’s moral intuition would be to save the infant.

It certainly feels morally correct to me to save a stranger’s baby.

If the choice is between “adult person I know or love” and “stranger’s baby”, that choice is perhaps the most difficult of all. And I am not entirely sure I would pick the adult.

It seems that my moral intuitions are primarily shaped by the maxim of “the strong should protect the weak”. There’s a European moral lineage of chivalry – the notion that you should help those who are helpless, save those who are oppressed, and otherwise seek to be a hero.

Intuitively, morally, I sense that as the right thing to do.

And I would argue that, even on purely consequentialist grounds, being of that particular moral disposition produces overall better outcomes for society.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей