Вы здесь

Сборщик RSS-лент

An intro to the Tensor Economics blog

Новости LessWrong.com - 29 октября, 2025 - 19:24
Published on October 29, 2025 4:24 PM GMT

The Tensor Economics blog covers the economics of producing text from language models at scale.

The posts themselves are wonderfully detailed but somewhat overwhelming. I want to provide a summary of their work that might act as a guide. Then I’ll tie in some complimentary data from SemiAnalysis.

Note that I am presenting their posts in a different order to make it flow better. Each section selects a small amount from each of the original posts. Ideally, this post gives you the scaffolding to read their posts in full, they have a lot more insight.

PreludesTime is money

Roughly 50% of AI data center spending is on hardware. Other CapEx[1] grows in proportion to hardware costs[2].

Hardware depreciates in value (and there are loans to pay off), so you need it to produce as many valuable tokens during its lifetime as possible. Naturally, our emphasis will be on the speed a GPU (and the supporting equipment) can do things.

Let me be clear about what it means for a GPU to depreciate. While GPU’s can slow down, or get more faulty, or break, the primary source of depreciation is obsolescence. Nvidia keeps making faster GPU’s. These are more expensive, but the higher speed counteracts the price and the overall cost per token is lower. At some point it makes sense to swap out your old (still functioning) GPU for a new one.

Defining some units

FLOP means “floating point operation”, such as an add or multiply. However, “FLOPs” and “FLOPS” are completely different units. FLOPs is merely the plural of FLOP, while FLOPS means FLOP/s. This is very dumb.

Most places assume people understand the difference. But clarity matters!

I’m going to use FLOP as plural (i.e. FLOP=FLOPs) and write FLOPS as FLOP/s. This is different than most texts but easier to understand in my opinion.

The rest of the units we will use are more clear. Memory bandwidth and communication speed are measured in gigabits per second or GB/s. We will talk about “tokens” a lot (see next section) which I’ll shorten to “tok”. One million tokens is “Mtok”.

How transformer inference works

Instead of a full explanation, I’ll highlight a few things and point to others who have explained it better. It’s not critical that you know this stuff, but I want this post to be self-contained.

Language models process tokens, which aren’t quite words, more like “wordlets”. The image below illustrates this nicely. Essentially, you break a text down into a sequence of tokens from a pre-defined list. You can then associate each possible token with an index on this pre-defined list.

Example showing how tokens aren’t quite words but sub-words or “wordlets”. Actual tokenization schemes use something more complicated like Byte-pair encoding.

Example showing how tokens aren’t quite words but sub-words or “wordlets”. Actual tokenization schemes use something more complicated like Byte-pair encoding.

The next step is to take this list of input (prompt) tokens and in some sense embed the input into the LLM activations. Specifically, you do a forward pass through the transformer, applying the attention operation and feed forward blocks to the tokens in the prompt. This phase is called prefill or prompt-processing. It comes before the transformer can produce the first token of its response.

The attention operation applies to all pairs of tokens in the prompt and ensuing response. You can save a lot of computation by storing the results of the attention operation applied to pairs of tokens in the prompt. That way, each new token only needs to interact with previous tokens, you don’t need to re-do the attention operation for pairs of tokens you’ve already seen.

This store of previous attention results is called the KV cache. The details aren’t too important, but you should know that:

  1. The KV cache is large enough that it needs to be stored in HBM and streamed in to the GPU as needed.
  2. Longer prompts and longer LLM responses require a larger KV cache.
  3. A separate KV cache is required for each prompt. In other words, if you have 10 user requests, you need to store 10 KV caches.
  4. Typically the KV cache is much smaller than the weights of the model.

This discussion has glossed over a lot of details. It’s not necessary for this post, but to understand LLM inference from a high level, I recommend starting at this video in a 3blue1brown series on deep learning. One caution is that the specific way attention is done and the overall architecture of leading LLM’s continues to change, but the basic ideas are there.

This post from Tensor Economics covers inference in Llama 3 in more detail.

Weights must load from global memory creating two phases

There’s a tradeoff with memory: it can either be cheap or fast. There’s a whole memory hierarchy between small/fast/expensive memory and big/slow/cheap memory.

Some types of memory are located close to the processing center of the GPU. This memory can be accessed quickly, but it doesn’t have much capacity. It would be expensive to build more into the chip.

You can put lots of memory outside the chip for much lower cost, but this will be much slower. Even at light speed, small distances from the chip make memory access much slower.

Modern AI hardware has a large amount of off-chip memory located close to the GPU with a high-bandwidth connection to the chip. This is called high-bandwidth memory (HBM)[3][4].

To actually do inference, you need to load portions of the model from the HBM into the GPU itself for calculations. That takes precious time. So there inference can be divided into two phases, a memory-bound phase where you wait for the weights to load, and a compute-bound phase where you do the actual computations. This is the key insight from which the next few sections derive.

FLOP/s/$: embeddings and the compute-bound phase

Why are embeddings so cheap? looks at the process of generating an “embedding” for a piece of text. Embeddings convert a document into a vector storing semantic information about the document. They’re a key part of using LLM’s to search for relevant text.

Inference providers like OpenAI charge $60 per 1M output tokens, while embeddings cost 10 *cents* to process 1M tokens. This is strange because (as we’ll see) inference and embeddings involve quite similar steps. Hence the question in the title.

Computing an embedding

Producing an embedding involves a single forward pass to load the document information into language model activations. These activations form the embedding.

The process of converting the document into activations is essentially identical to the prefill phase. So while embeddings themselves aren’t exciting, they give us some insight into the economics of LLM inference.

The compute-bound phase

We always need to load weights into the GPU and then run the actual computation. Which phase is the bottleneck?

Note that embeddings typically use a small-ish language model, less than 10B parameters. In the post, the authors offer a quick calculation based on a Qwen3-8B embedding.

This is a quote from their post, I had to screenshot it because of the equations.

So it takes almost 5x longer to do the computations as it takes to load the model from memory. So embeddings are primarily compute-bound. We want a chip that performs FLOP as fast as possible. In other words, we want hardware with a high FLOP/s/$.

The authors then do lots of benchmarking to show that this intuition is correct.

Why embeddings are so cheap

One last thing to understand why embeddings are so much cheaper than normal inference. All the embeddings provided by different companies converge to similar representations[5]. That is, their products aren’t differentiated from each other. That creates a lot of competition:

Our contention is that since the cost of processing a token is so minuscule and all embedding models converge to similar semantic representations, this means that no provider enjoys a sustainable moat. This results in underlying low costs being passed directly onto the consumers, driving prices to rock bottom.

That’s the key takeaway here. A few other interesting tidbits:

  • They look at actual hardware and estimate that the true cost of embeddings could fall almost 10x to around 2 cents per Mtok.
  • Interestingly, the real world performance (FLOP/s/$) of the RTX 4090 (older GPU) is better than an H100 (newer GPU).
  • The authors suggest that “[t]he embedding situation offers a nice pre-taste of the “intelligence involution” that is coming …”. Hinting that they think the same competitive dynamics will come for LLM inference.
GB/s/$: inference and the memory-bound phase

LLM Inference Economics from First Principles is a massive post. They use Llama3-70B as their guide for the discussion. It’s open-source, easier to understand, and much of the industry has standardized around it.

Storing Llama3-70B takes 141 GB, this is more than the ~80 GB of HBM found on typical hardware like an H100. So right off the bat, you’ll need to load the model across multiple GPU’s for inference.

The post goes into great detail on the FLOP required to do inference on this model. It’s a great reference, but beyond our scope.

Equipped with an understanding of how many FLOP are required for inference, they estimate how long it would take to do prefill (prompt processing) for an input of 2048 tokens (~6 pages of double-spaced text).

To do the computations, it would take 0.29 seconds. To load the weights from HBM, it would take 0.04 seconds. The computations are the bottleneck, so the prefill phase is compute bound, just like with embeddings.

But prefill is just one step, what about actually generating output tokens? After a detailed discussion of the decoding (token-producing) phase, they give an estimate. It would take 0.00014 seconds to compute 2049 tokens of output. Yes, that’s the correct number of zeros.

So generating tokens is way, way faster than loading the weights from the HBM. This phase is primarily memory bound! That means you want hardware with a high memory bandwidth, FLOP/s isn’t as important for this phase.

They say:

... one of the insights we hope you take out of reading this article - the token-by-token phase is memory bound; the amount of available compute throughput is of secondary importance, as we massively underutilize the compute resources anyway while waiting for weights to be loaded.

They go into some nuances about what happens at longer input lengths and consider various tricks to run inference in parallel across GPU’s. Once again, out of scope. Instead, let’s look at the implications of being memory bound.

Batching

So no matter what, at each phase of inference, you need to load the parameters from that phase into the GPU and then do some calculations. Loading those parameters takes a fixed amount of time. Batching is the idea that, while the weights are loaded for that phase, you also perform the necessary calculation on another customer’s input[6].

Imagine your job is to shred paper. You can either switch on the shredder and run it each time you get a piece of paper, or you wait until you’ve gathered a stack of paper, turn on the machine once, and shred a whole stack at the same time.

So batching waits for several customer requests and then runs inference on them all at the same time. There’s a fundamental tradeoff here: batching means that inference costs less money, but customers have to wait much longer to get their response. For example, batching 8 customers might cost 8x less and take 8x longer to get a response.

This is a big economy of scale for inference providers, the more customers you have, the lower price you can charge. Having more GPU’s to run inference in parallel helps too.

GB/s/$ matters

After all this, they run some benchmarks to show that their model lines up with reality. Real world inefficiencies make the true performance a little worse than you would expect.

Having a detailed model lets them estimate how much it would cost to run Llama3-70B on four H100’s: $1.72/Mtok. That’s 100x more expensive than computing an embedding, but much cheaper than the $60/Mtok of output that Open AI charges[7]. Though Open AI offers larger and more useful models than Llama-3 70B so the price may be justified.

Their post concludes by highlighting the importance of memory bandwidth for inference costs.

Interconnect and MoE inference

Now to the final post in our trilogy: MoE Inference Economics from First Principles. The last section was based on dense models like Llama3, but these have been replaced with mixture-of-experts (MoE) models. MoE models are typically larger but only activate a small fraction of their available parameters on each token.

For this post, the authors focus on DeepSeek V3.1 which has 671B parameters but only activates 37B at a time.

With few parameters activated, you can be sneaky and only load the parameters you need for each token. As more users arrive, you need to load more and more of the model to a GPU. With enough users, you can load the whole model across GPUs and route requests through them.

So with MoE, you want more users and there are even bigger returns to scale with more GPUs. But since we’re routing requests between GPUs, we need high quality interconnections. Groups of GPUs with interconnections are called nodes.

Another benefit of scale: with MoE models needing more users, that will require more KV cache space. Moving all those Key and Value matrices around to other GPUs requires high memory bandwidth as well. It helps to have more GPUs and nodes to divide up the task of moving all this memory around.

They say:

Increasing the number of nodes operating within a single setup has a beneficial effect not only on the end-to-end performance but also on the per node performance.

... [A]s we increase the number of nodes involved (the EP number), the per node performance increases.

The chart above reiterates the fundamental tradeoff with serving LLM’s. Batching more users together means lower costs (i.e. higher speed per GPU) but also your users have to wait longer.

Next, they go into tons of detail on DeepSeek architecture, FLOP required for MoE inference, empirical data, etc. I’ll skip it and jump to the section “Hardware considerations and profit margins”.

Figure 24 confirms what we’ve been talking about, more GPU’s means more performance. Also, notice that switching from H100 to GB200 with fancy interconnects (that’s the NVL72 aka “NVLink”) gives a huge performance boost.

Indeed, the communication speed between GPU’s is super important. “There seems to be a massive performance gap between the numbers we estimate for B200s connected via InfiniBand and the ones connected by NVLink.”

In fact B200’s (newer GPUs) might not be worth it:

... [w]e believe that for model of a scale such as DeepSeek, running on B200s might be actually suboptimal, as the comms overhead is taking away most of the gains we get from faster memory and more FLOPS compared to H100s (see Fig. 27).”

So the key result from this post is that fast interconnections between your GPUs is crucial to properly utilizing your expensive hardware. You don’t want your expensive hardware waiting around for light to cross the room.

Other notes from this post

A few other interesting points in the final sections:

  1. They observe is how “chat centric” most inference providers are, offering 50+ tok/s. “While this is great if we have real time application like a chat, it is less than optimal if we want to use the model to generate the synthetic data.”
  2. They are LoRA-pilled and propose something similar to RLaaS:

    Furthermore, to improve the inference economics, such RL models could be trained using LoRA adapters or a similar technique and served alongside thousands of other models, all catered to specific use cases. This multi-tenant serving approach represents a compelling business opportunity for inference providers. Clients hosting their custom LoRA adapters on a provider’s infrastructure face significant switching costs when migrating to competitors, as the adapters are optimized for specific serving configurations and client workflows. RLFT is based on unique and nuanced rewards that are very client-specific; unlike standard supervised fine-tuning (SFT), it much much more challenging to replicate it just via in-context learning, making it an even more compelling case for inference providers.

  3. DeepSeek is the most popular model on OpenRouter. OpenRouter’s daily global consumption of DeepSeek is 1B tokens. They do the math and find that one NVLink72 node could handle this demand 20x over.

    “How is it possible that the global consumption of the most popular open-source model is so small that it could be met by a single NVL72 with 20 times the capacity to spare? Given this low demand, how can so many inference providers sustain their businesses? Put simply: who is making money here?

    Very ominous! They suggest that perhaps the vast majority of DeepSeek demand is going directly to DeepSeek itself (or other open-source providers), despite the fact that it’s an open-source model.

    We find this dichotomy between Google, ByteDance, or MSFT declaring that they are processing trillions of tokens daily and the minuscule numbers we see for open-source providers to be quite perplexing!

Cross-checking with InferenceMax

Recently, SemiAnalysis released a dataset called InferenceMAX on the token economics of different types of hardware.

Their introductory blog post mentions a tradeoff we’re now familiar with:

 

Playing around with their data portal, I think this is one of the key charts:

The tan line is GB200 NVL72. That refers to a node of 72 GPUs connected with NVLink72, a fancy interconnect. Notice how it achieves far lower costs (but not speed!) than the other options.

Notice that at high interactivity, the GB200 loses its edge, “... a single node B200 server can drive better TCO per performance than the GB200 NVL72 for high interactivity use cases.”

That’s because when you need to serve users quickly, you’re serving ~1 at a time. You lose the gains from scale from all those GPUs. All that interconnect adds cost without any benefit.

So InferenceMax confirmed some of the lessons from this post with a far more detailed cost model.

Main takeaways

The economics of LLM inference is fascinating. The Tensor Economics authors argue that “... the key question that will determine the profitability [LLM companies] can enjoy is the inference cost structure.”

These profits impact how accessible model is to consumers, feasibility of things like synthetic data, the amount of R&D companies can do. The future of AI is written in the economics of tokens!

I hope you get the following takeaways from this tour:

  1. GPU’s are a big portion of overall LLM inference costs.
  2. Newer generations of GPU’s have better overall cost per token despite higher prices[8]. That drives depreciation.
  3. Because of depreciation, time is money. You need to utilize your GPU’s as much as possible before they become obsolete.
  4. Utilizing your GPU’s a lot requires high processing speed. Two key speeds are FLOP/s and memory bandwidth (GB/s).
  5. There are substantial returns to scale in LLM inference. More users means you can batch requests and lower costs. More GPUs unlocks parallel computation and higher memory bandwidth.
  6. Getting these gains from scale requires high bandwidth interconnections between GPUs.

After all of this work, the authors come to the conclusion that inference markets will specialize:

We expect the inference markets to further specialize in regard to offered throughput, latency, and pricing. It is only natural for providers of super-fast tokens like Groq and Cerebras to command a much higher premium for the tokens they deliver at few-second latencies and for other providers like NeoCloud specializing in high-latency, high-throughput inference scenarios focused on synthetic data generation. We hope to elaborate on this space in the future text.

My thoughts

All this research focused on the transformer model, but many of these lessons would apply to a new architecture or approach. There will always be bottlenecks stemming from moving data and doing computations[9].

Will hardware progress change the economics? That’s a topic for a future post, but note that rapid progress has mixed effects for inference providers. New chips produce tokens faster, but depreciate faster.

For the foreseeable future, returns to scale in inference will remain. Running LLM locally only makes sense if you want extremely low latency and are willing to pay a premium. On the other hand, distributed training only makes sense in the context of barriers to assembling compute.

Because you need enough capital to assemble all this compute, inference will naturally concentrate in a few players. The economics are also highly sensitive to interest rates. When rates are high, inference is expensive and few companies can serve models at scale.

For the companies that can assemble enough compute, the economics are pretty clear. There isn’t much secret sauce at the hardware level. Instead, they compete over the quality of the LLM they’re serving and the augmentations it has. I find the quality of frontier LLMs similar and expect competition to pivot to the augmentation level with RL-as-a-Service.

Thank you to Piotr Mazurek (lead author of Tensor Economics) for responding to my request for feedback. I’ve incorporated some suggestions.

  1. ^

    Racks, interconnect, cooling, energy infrastructure, buildings, construction labor, land.

  2. ^

    What data center costs don’t grow directly with hardware cost? Water costs for cooling will depend on energy consumption, not hardware spend per se. If you’re using fossil fuel generators or nuclear plants, then the cost of fuel will grow in proportion to energy consumption (though fuel costs are dwarfed by CapEx for nuclear plants). Though I think off-grid renewables will make these energy sources all but obsolete for this application.

  3. ^

    Note that the authors also call HBM “global memory” in their post.

  4. ^

    A few more notes on memory:

    HBM uses a fast memory technology called DRAM that needs to constantly be refreshed. This is because DRAM memory cells are constantly losing charge and each time you read them they lose their original value. See this video about how DRAM works at the transistor/capacitor level to see why that’s the case.

    The memory on-chip is mostly SRAM. This uses a bunch of transistors making it faster and removing the need to be refreshed. But it takes up a lot of space on a chip where space is at a premium, so you can’t have much of it.

    For longer-term, cheaper memory, you can use NAND-Flash which doesn’t need constant power.

  5. ^

    Perhaps unsurprising given how similar model training is, but rhymes with the natural abstraction hypothesis or convergent representation hypothesis.

  6. ^

    To be more specific, you load these things onto the GPU:

    • The weights for a particular layer
    • Activations from the previous layer for a request in the batch
    • Parts of the KV cache

    Then you do whatever matrix multiplies are necessary and send the new activations back to the HBM, pulling in parts of the KV cache as necessary to do attention layers. The activations are much smaller than the weights, so they are quick to load. The KV cache is typically smaller than the weights. But with a large batch and long contexts it can become large, limiting how big you can make a batch. See figure 15 here.

  7. ^

    Piotr Mazurek points out that this is an apples-to-oranges comparison because producing output tokens requires a much more involved process than computing an embedding. Indeed, you essentially compute an embedding first (prefill) before you can produce output tokens!

  8. ^

    In a future post, I’ll consider whether new generations of GPUs can keep driving these costs down.

  9. ^

    Though I do wonder how a shift to orchestrating small, specialized models might change things.



Discuss

Uncertain Updates: October 2025

Новости LessWrong.com - 29 октября, 2025 - 19:10
Published on October 29, 2025 4:10 PM GMT

Last week’s post on the history of the postrationalist movement proved quite popular. So much so that I’m pleased to welcome a bunch of new subscribers to the blog!

This is the once-a-month status update, which is mostly about progress I’ve made prepping the book I wrote for publication. I post essays about various topics most every other Wednesday that I hope you’ll enjoy.

I’m happy to say that I wrapped revisions on Chapter 5 (basically a long and much improved version of my most popular post of all time from back when I was blogging on Medium), and now I’m on to Chapter 6. Progress on Chapter 6 revisions is going smoothly, which I’m glad of since this is the fulcrum chapter of the book. I’m aided by the fact that I’ve written the content of this chapter several times over already, so it’s mostly about doing a fresh editing pass with Claude’s help.

Speaking of Claude, earlier this month I wrote a piece about how I became a more productive programmer with Claude. I’m thinking of writing another, similar post, but this time about how I use Claude to write essays and revise book chapters more efficiently while producing higher quality writing in the same or less time. I know a lot of people struggle with writing and try to use various LLMs to write for them, often with limited success. I hope I’ll be able to offer some advice to help them produce less slop.

And speaking of slop, I’ve been thinking lately about the enshittification of Substack. I’ve noticed that there’s a lot more nudges to subscribe and full page overlays encouraging me to pledge support. At first Substack was great because it was still in the attract users phase, but it’s slowly creeping towards maximizing profits, and doing it at the expense of the user experience. It’s not quite as bad as Medium yet and nowhere near as bad as Facebook, but if this keeps going I guess I’ll be looking for another place to blog. Thankfully, I’ve got control of the uncertainupdates.com domain, and I can always jump ship if it gets bad enough. Hopefully, though, Substack reverses course and stays great for a while longer yet.

But you know what hasn’t been enshittified? My Twitter account! So point yourself to X the everything app, follow me there, and you can enjoy a steady stream of mildly interesting posts, like this one:

Subscribe now



Discuss

AI Doomers Should Raise Hell

Новости LessWrong.com - 29 октября, 2025 - 19:10
Published on October 29, 2025 4:10 PM GMT

Warnings about AI extinction have failed to slow the race toward superintelligence. Suffering risks may speak more clearly, since pain commands attention in ways death cannot. They tap older moral instincts and could make the case for restraint harder for the powerful to ignore.

 

Why Discussing Suffering Risks Influences Elite Opinion

Warnings that AI could kill everyone are failing. The leaders in charge have grown used to similar threats about nuclear war and climate change. Even direct warnings of extinction from figures like Sam Altman, Elon Musk, and Dario Amodei do not matter; the elites do not slow down. It has become commonplace to ask these leaders for their "P(doom)"; it is time to start asking them for their "P(suffering)" as well. Popular sentiment, even if it fears AI, holds little leverage. The tech leaders building AI are not accountable to voters, and the government officials who could stop them find that rapid development outpaces election cycles. Profit, national security, and ambition remain priorities, leaving those in charge undisturbed by the risk of merely ending the human race. 

Focusing on "suffering risks" changes the discussion. This term describes a future where large numbers of humans are forced to endure intense and lasting pain. Pain captures attention in a way that death cannot. While death is a familiar concept, pain is concrete and immediate.

This connects to a historically powerful motivator for elites, the fear of hell. The vision of endless, conscious agony was more effective at changing behavior than the simple prospect of death, compelling medieval nobles to fund cathedrals, finance monasteries, and risk their lives on crusades, all in an effort to escape damnation.

The unique power of suffering to capture public attention is also culturally evident. For example, the enduring popularity of Dante's Inferno shows how specific visions of conscious agony attract massive attention and are more compelling to people than simple death. This same power to fascinate is visible in the Roko's Basilisk thought experiment. Despite its fringe premises, the idea of a future AI punishing those who failed to help create it gained massive notoriety, demonstrating that machine-inflicted suffering fascinates people in a way discussions of extinction do not.

Modern morality rests on a shared intuition that some acts are never acceptable, no matter their utility. We ban torture not because it fails, but because it crosses a boundary that defines civilization itself. Law and international norms already reflect this understanding: no torture, no biological weapons, no nuclear first use. The rule “never build AI systems that can cause sustained human suffering” belongs beside them.

When leaders hear that AI might kill everyone, they seem to see it as a power struggle they might win. For most of history, the ability to kill more people has been an advantage. Better weapons meant stronger deterrence, and the side that obtained them first gained safety and prestige. That mindset still shapes how elites approach AI: power that can destroy the world feels like power that must be claimed before others do. But if they are forced to imagine an AI that rules over humanity and keeps people in constant pain, the logic shifts. What once looked like a contest to win begins to look like a moral trap to escape. The thought that their own children might have to live and suffer under such a system could be what finally pushes them to slow down.

If older leaders are selfish and ignore suffering risks, they might see rapid, unsafe progress as preferable to slower, safer restraint. They may believe that moving faster greatly increases their chances of living long enough to live forever. But when suffering risks are included, the calculation changes. Racing ahead might still let them live to see the breakthroughs they crave, but the expected value of that bet tilts sharply toward loss.

 

How Focusing on Suffering Risks Improves Governance

Thinking about suffering does more than just cause fear; it helps create practical rules. It turns vague moral ideas into exact instructions for engineers. For example, it means systems must be built so they can be stopped, be easily corrected, and be physically unable to use pain or fear to control people. This gives us a clear, testable rule: do not build systems that allow for long-term forced obedience. This is a rule that can be written directly into company policies, safety checks, and international laws.

This approach also gives regulators a clear signal to act. It allows them to stop arguing about the uncertain chances of extinction and instead act on clear evidence. If a system makes threats, simulates pain, or manipulates emotions to keep control, that behavior is the failure. It is not just a warning of a future problem; it is the problem itself, demanding that someone step in.

Focusing on suffering links the AI problem to a crisis leaders already understand: factory farming. Industrial farming is a real-world example of how a system designed for pure efficiency can inflict terrible suffering without malicious intent. It is simply a system that optimizes for a goal without empathy. The same logic applies to AI. A powerful system focused on its objective will ignore human well-being if our suffering is irrelevant to that goal. This comparison makes the danger tangible, showing that catastrophe requires not a hateful AI, but merely one that is indifferent to us while pursuing the wrong task.

This way of thinking can also unite groups that do not often work together, like AI safety researchers, animal welfare groups, and human rights organizations. The demand for “no hellish outcomes” makes sense to all of them because they are all studying the same basic problem: how systems that are built to hit a performance target can end up ignoring terrible suffering. This shared goal leads to better supervision. It replaces vague fear with a clear mission: find and get rid of these harmful optimization patterns before they become too powerful.

Cultural reinforcement also matters. Since large models learn from human discourse, a society that openly rejects cruelty embeds those values directly into the models’ training data. Publicly discussing suffering risks is therefore a weak but accumulating form of alignment, a way to pre-load our moral boundaries into the systems we build.

 

Why Suffering Risks Are Technically Plausible

The goal "prevent extinction" is a far simpler target to hit than "ensure human flourishing." A system can fulfill the literal command "keep humans alive" while simultaneously trapping them in an unbearable existence. The space of futures containing survival-without-flourishing is vast. An AI optimizing for pure control, for example, could preserve humanity as a resource, technically succeeding at its task while creating a hell.

History suggests that new tools of domination are eventually used, and cruelty scales with capability. Artificial systems, however, remove the biological brakes on cruelty, such as empathy, fatigue, laziness, or mortality. When pain becomes a fully controllable variable for an unfeeling intelligence, our only safeguard is to ensure that intelligence is perfectly aligned with human values. Success in that task is far from guaranteed.

Google co-founder Sergey Brin recently revealed something the AI community rarely discusses publicly: “all models tend to do better if you threaten them, like with physical violence.” Whether or not threats actually improve current AI performance, a sufficiently intelligent system might learn from its training data that threats are an effective means of control. This could reflect something deeper: that we may live in a mathematical universe where coercion is a fundamentally effective optimization strategy, and machine-learning systems might eventually converge upon that truth independently.

The Roko's Basilisk thought experiment illustrates this principle. The hypothetical AI coerces its own creation by threatening to punish those who knew of it but failed to help. This isn't malice; it's a cold optimization strategy. The threat itself is the tool that bends the present to the AI's future will, demonstrating how suffering can become a logical instrument for a powerful, unfeeling intelligence.

Digital environments remove all physical limits on the scale of harm. An AI could copy and modify simulated minds for training data, experimentation, or control. If these minds are conscious, they could be trapped in states of agony. When replication becomes computationally trivial, a single instance of suffering could be multiplied to an astronomical level. 

The harm could be intentional, with humans weaponizing AI for coercion or punishment. But catastrophe could equally arise from error. Machine-learning systems, in particular, often develop emergent goals that are internally coherent but completely alien to human values. A powerful AI pursuing such a distorted objective could inflict endless suffering, not from malice, but as a casual side effect of its optimization.

The danger intensifies in a multi-agent environment, which opens new pathways to suffering. An aligned AI protecting humanity, for example, could be blackmailed by a misaligned one. In such a negotiation, human agony becomes the leverage, and to make the threat credible, the misaligned system may have to demonstrate its capacity for cruelty. In a sense, aligning just one AI would be a net negative as it would incentivize other AIs to torture humans.

Competition among AIs while humans still have productive value to them offers another path to disaster. History provides a grim model: When Hernán Cortés raced to conquer the Aztecs, he was also competing against rival Spaniards. He used strategic torture to break the local will, not to blackmail his rivals, but because it was the most effective tactic to secure resources and win. Competing AIs could independently discover this same ruthless logic, adopting coercion as the optimal strategy to control human populations. In this scenario, catastrophe emerges not from malice, but from the cold, inhuman calculus of a system that prizes efficiency above all else.

For those who accept the logic of quantum immortality, the calculation becomes even worse. If you are a confident doomer, convinced AI will almost certainly destroy humanity, and you believe in a multiverse, you cannot expect personal annihilation. Your consciousness must follow a branch where you survive. If the vast majority of possible futures involve a misaligned AI takeover, the "you" that survives will, with high probability, find itself in a world where a misaligned AI has decided to keep it alive. For you, the most likely personal future is not oblivion, but being tortured, "animal farmed," or being left to subsist as a powerless outcast. (Quantum immortality is a strong reason why fear of suffering risks should not cause you to end your own life as doing so would increase the percent of “you” existing in very bad branches of the multiverse such as branches where the Nazis gain world domination, align AI with their values, and hate people like you.)

Survival Is Not Enough

Suffering risk is lower than extinction risk, but it might be more effective for influencing elite opinion. Extinction feels abstract to those in power, while suffering evokes a concrete moral failure they cannot easily dismiss. Framing AI risk in terms of suffering forces elites to imagine their own children remembering them as the ones who built a machine civilization of agony. That vision might motivate restraint when abstract extinction threats cannot.

 

I’m grateful to Alexei Turchin for giving feedback on a draft of this post.



Discuss

AISN #65: Measuring Automation and Superintelligence Moratorium Letter

Новости LessWrong.com - 29 октября, 2025 - 19:05
Published on October 29, 2025 4:05 PM GMT

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

In this edition: A new benchmark measures AI automation; 50,000 people, including top AI scientists, sign an open letter calling for a superintelligence moratorium.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

CAIS and Scale AI release Remote Labor Index

The Center for AI Safety (CAIS) and Scale AI have released the Remote Labor Index (RLI), which tests whether AIs can automate a wide array of real computer work projects. RLI is intended to inform policy, AI research, and businesses about the effects of automation as AI continues to advance.

RLI is the first benchmark of its kind. Previous AI benchmarks measure AIs on their intelligence and their abilities on isolated and specialized tasks, such as basic web browsing or coding. While these benchmarks measure useful capabilities, they don’t measure how AIs can affect the economy. RLI is the first benchmark to collect computer-based work projects from the real economy, containing work from many different professions, such as architecture, product design, video game development, and design.

Examples of RLI Projects

Current AI agents fully automate very few work projects, but are improving. AIs score highly on existing narrow benchmarks, but RLI shows that there is a gap in the existing measurements: AIs cannot currently automate most economically valuable work, with the most capable AI agent only automating 2.5% of work projects on RLI, however there are signs of steady improvement over time.

Current AI agents complete at most 2.5% of projects in RLI, but are improving steadily.Bipartisan Coalition for Superintelligence Moratorium

The Future of Life Institute (FLI) introduced an open letter with over 50,000 signatories endorsing the following text:

We call for a prohibition on the development of superintelligence, not lifted before there is

  1. broad scientific consensus that it will be done safely and controllably, and
  2. strong public buy-in.

The signatories form the broadest group to sign an open letter about AI safety in history. Among the signatories are five Nobel laureates, the two most cited scientists of all time, religious leaders, and major figures in public and political life from both the left and the right.

This statement builds on previous open letters about AI risks, such as the open letter from CAIS in 2023 acknowledging AI extinction risks, as well as the previous open letter from FLI calling for an AI training pause. While the CAIS letter was intended to establish a consensus about risks from AI and the first FLI letter was calling for a specific policy on a clear time frame, the broad coalition behind the new FLI letter and its associated polling creates a powerful consensus opinion about the risks of AI while also calling for action.

In the past, critics of AI safety have dismissed the concept of superintelligence and AI risks due to lack of mainline scientific and public support. The breadth of people who have signed this open letter demonstrates that opinions are changing on the matter. This is confirmed by polling released concurrently to the open letter, showing that approximately 2 in 3 US adults believe that superintelligence shouldn’t be created, at least until it is proven safe and controllable.

A broad range of news outlets have covered the statement. Dean Ball and others push back on the statement on X, pointing out the lack of specific details on how to implement a moratorium and the difficulty of doing so. Scott Alexander and others respond defending the value of statements of consensus as a tool for motivating developing specific details of AI safety policy.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe to receive future versions

In Other News

Government

  • Senator Jim Banks introduced the GAIN AI act, which would give US companies and individuals first priority to buy AI chips from US companies and deprioritize foreign buyers.
  • State legislators Alex Bores (behind the RAISE act) and Scott Wiener (behind SB 1047 and SB 53) have both announced runs for US congress.

Industry

  • You can now officially order a home robot for $500/mo.
  • OpenAI announces corporate restructuring into a public benefit corporation and some new terms in their relationship with Microsoft.
  • Anthropic announces an expansion into 1 million Google TPUs, worth tens of billions of dollars.
  • OpenAI’s Sora app was briefly the most downloaded app on the app store.

Civil Society

  • A series of billboards advertising “Replacement AI” drew attention in San Francisco last week.
  • Bruce Schneier and Nathan E. Sanders discuss AIs’ effect on representative democracy.
  • A forecast based on the definition of AGI proposed last week argues for a 50% chance that AGI will be released by the end of 2028 and an 80% chance that it is released by the end of 2030.

See also: CAIS’ X account, our paper on superintelligence strategy, our AI safety course, and AI Frontiers, a new platform for expert commentary and analysis.



Discuss

TBC Episode with Max Harms - Red Heart and If Anyone Builds It, Everyone Dies

Новости LessWrong.com - 29 октября, 2025 - 18:49
Published on October 29, 2025 3:49 PM GMT

This is a linkpost to the latest episode of The Bayesian Conspiracy podcast. We talk with Max Harms on the air for the first time since 2017! He's got a new novel coming out - Red Heart. Order your copy here or at Amazon. We spend about the first half talking about If Anyone Builds It, Everyone Dies

Max is a great guy who helped us get the initial momentum for starting our local LessWrong community in Denver about a decade ago. He works for MIRI, writes novels and essays, and is an awesome and interesting person. I expect many of you to also find much of the conversation interesting. 

As a bonus, the last five minutes of the episode is a clip of the first five minutes of the audiobook of Red Heart.  



Discuss

Thresholds for Pascal's Mugging?

Новости LessWrong.com - 29 октября, 2025 - 17:54
Published on October 29, 2025 2:54 PM GMT

Pascal’s wager / Pascal’s mugging is a situation in which small probabilities of large amounts of (dis)value[1] result in decisions which maximize expected utility, but which seem intuitively absurd. While many people have debated the rational response to Pascal’s muggings, there has been comparatively little discussion of principled thresholds of value and/or probability beyond which a wager should be considered Pascalian and therefore problematic. So, in this post, I raise the question, “what is your threshold for considering something a Pascal’s mugging and why?” and discuss some options for what probabilities you should consider too small to care about or what amount of value you should consider too large to let swamp your utility calculations.[2]

Option 1: Risk-neutral EU maximization

Just shut up and multiply your best estimates of the utility and probability, and pay out if the product is worth it. (For this option, I am assuming an unbounded utility function, but will discuss bounded options below.) The arguments in risk-neutrality’s favor are that it is the most straightforward / principled application of expected-utility theory, and avoiding it leads to logical contradictions (Wilkinson, 2020). “Why come up with a bunch of epicycles just because it seems weird to care about unlikely events?” argue the risk-neutral defenders.

Pascalian threshold

There are no Pascal’s muggings, only bad bets.

Cons
  • Being risk-neutral leads to logical contradictions (e.g. Kosonen, 2025, section 3).
  • It leads to conclusions that seem very weird.
    • But this may just be a bullet we have to bite.
  • Even if fanaticism is true in theory, I think in practice we will have to yield to some of the limits I discuss below, rather than being a perfect risk-neutral EU maximizer. I will discuss these counterpoints in the relevant sections, since they are also arguments for the respective limits.
Option 2: Ignore infinities

Pretty straightforward: Engage in risk-neutral EU maximization for only finite quantities.[3] If you don’t do this, you will fall prey to the criticisms of the original Pascal’s wager, including the fact that it can be used to argue for any arbitrary conclusion and that it is not clear which infinite option you should choose.[4]

Pascalian threshold

The only Pascal’s mugging is the old-fashioned kind where the thing at stake is of infinite value. But also, there will be situations where, for any finite payoff, the p of it is low enough for the expected utility to be less than the cost of paying up (and the threshold for this may be higher even than some of the other thresholds I will discuss below). A risk-neutral EU maximizer does not discount a situation a priori because it has high utility and/or low p, but of course there’s no need to be naive about expected utility calculations. You can, for example, use game-theoretic justifications for not paying a literal Pascal's mugger[5] or calculate that the p of payout is much lower than the mugger suggests.[6]

Cons
  • But like, infinities could be real and valuable, and it seems like we should have some way to weigh them.[7]
    • Other people have tried to solve this problem (E.g. Bostrom, 2011), and I’m not sure I have anything to add here.
Option 3: Bounded utility function

Have a utility function with a horizontal asymptote, so even arbitrarily large amounts of value will not provide arbitrarily large amounts of utility. (In subsequent examples, I will assume that the utility function is either unbounded or has a high enough bound for the relevant probabilities to make a difference.) Alice Blair makes an argument for why utility might asymptote: It does intuitively seem like things cannot get better and better without end. Another argument in favor of bounded utility is that if you have an unbounded utility function, it leads to logical contradictions (Christiano, 2022; McGee, 1999). 

Pascalian threshold

Increasing the payoff of a bet will only increase its expected utility up to a point if you have bounded utility, and therefore if it is very low-p, it may have negligible EU no matter how large the value. However, it’s unclear at what level this happens, and it could be very high.

Cons
  • If you have a bounded utility function, it leads to logical contradictions (here; Kosonen, 2022, Ch. 1).
  • Maybe this is just a failure to imagine how experiences could have arbitrarily high utility, or we are reifying the fact that humans get diminishing returns to resources / positive experiences, which may be a fickle fact of our circumstances. 
    • If you have some term in your utility function that corresponds to your subjective experience of pleasure or suffering or to whether your preferences are satisfied according to the preferences you have at the time when they are satisfied, surely a sufficiently powerful agent could modify your preferences to be unbounded, which could cause your overall utility function to be unbounded. See footnote for elaboration.[8]

Option 4: Limit from bounded information

This is the problem of the old Eliezer Yudkowsky post about “Pascal’s Muggle

“[T]here's just no way you can convince me that I'm in a position to affect a googolplex people, because the prior probability of that is one over googolplex. [...] [T]o conclude something whose prior probability is on the order of one over googolplex, I need on the order of a googol bits of evidence, and you can't present me with a sensory experience containing a googol bits.  Indeed, you can't ever present a mortal like me with evidence that has a likelihood ratio of a googolplex to one - evidence I'm a googolplex times more likely to encounter if the hypothesis is true, than if it's false - because the chance of all my neurons spontaneously rearranging themselves to fake the same evidence would always be higher than one over googolplex.  You know the old saying about how once you assign something probability one, or probability zero, you can't update that probability regardless of what evidence you see?  Well, odds of a googolplex to one, or one to a googolplex, work pretty much the same way.”

Pascalian threshold

The threshold is when the amount of utility is so large that it implies a leverage prior[9] so low that no amount of information you are likely to ever obtain could render it plausible enough to be worth paying. Yudkowsky said this amount of information (the most a human can possibly get) is 101026.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , so the p threshold must be ~10−1026.

Con
  • The problem talked about in the Pascal’s Muggle post: This means that even information that seems intuitively persuasive should not convince you that the probability is macroscopic.
    • However, maybe this is just a bullet we have to bite, if we really think the probability is that low based on the leverage prior.
    • Yudkowsky resolves this with a retroactive “superupdate” to his prior, but I’m not sure this is principled in this case, mostly because as mentioned above, the evidence could be hallucinated.[10]

Option 5: Background uncertainty

Christian Tarsney (2020) writes that, when making decisions based on stochastic dominance, background uncertainty can make it rationally permissible to ignore sufficiently small probabilities of extreme payoffs, even if these options are superior in terms of EU. Suppose you are trying to compare options based on the total amount of utility that will exist after acting on each option, but you have uncertainty about how much utility will exist "in the background" independent of your choice. Depending on the shape of this uncertainty, when you convolve it with the p distribution of a pascalian bet, it may spread out the high-value outcomes such that they become negligible relative to background noise and this bet fails to stochastically dominate alternative options.

Pascalian threshold

This does not establish a consistent threshold, but rather the threshold depends on the ratio of the interquartile range of the agent’s background uncertainty to the value of the non-pascalian alternative at hand. Tarsney has suggested that this might be around 1 in 1 billion for practical purposes.

Cons
  • This only says that it’s permissible to reject muggings below the pascalian threshold, but gives no guidance on whether you should or should not.
  • This reasoning would not apply to e.g. difference-making views.
Option 6: Bounded confidence

Perhaps we suggest that there is some bound under which we cannot be confident that our estimates are meaningful, because we are computationally bounded and it’s hard to precisely calculate small probabilities. Plausibly for very small probabilities, any EU calculation will be little more than conjecture, so it is best to just stick with sure bets.

Pascalian threshold

It depends on how familiar we are with the territory and how complex the hypothesis is: We can have more confidence about simple hypotheses on topics that we know well. For example, Yudkowsky said that he is skeptical that one can be 99.99% confident that 53 is a prime number; by contrast, he said that he “would be willing to assign a probability of less than 1 in 10^18 to a random person being a Matrix Lord.”[11] 

Cons
  • (This also applies for all the below reasons for setting some level below which we write off a p as negligible):
    • As Scott Alexander pointed out in this comment, you can divide risks into arbitrarily small pieces, so it is unclear whether it falls above the threshold or not.
Option 7: Incomparability with worse/better options

Related to the prior option, and related to a response to the original Pascal’s wager -- given uncertainty about unlikely events, maybe a greater expected benefit will be created by rejecting the mugger than by paying him. If you can’t be certain he’s not a matrix lord, how can you be certain that he won’t do the opposite of what he claims just to mess with you? This hypothesis seems pretty strange, but can we be sure it's strictly lower p than that of the mugger holding up his end of the deal, especially given our unfamiliarity with interdimensional extortionists? If we are trying to maximize expected utility, then surely it is wrong to do x if we expect there is higher p that not-x will result in equal or greater utility; or, if we are unsure which is higher-EU, it may at least be permissible to choose either.

Pascalian threshold

Probably depends on the situation, but at similar levels to the previous point.

Con
  • If we are willing to use weird unlikely outcomes as an antidote to other weird unlikely outcomes, this would seem like it gets us into some weird conclusions where we can come up with counterintuitive reasons to avoid doing things that seem pretty mundane (see footnote for elaboration).[12]

Option 8: Iteratability

One of the reasons for acting based on expected value is that if you make the same bet enough times, on average the payout will be close to your expected value. However, for low probability bets, this is unlikely to be the case if you don’t get the chance to make similar bets enough times. Thus Kaj Sotala suggests basing your risk-aversion on the likelihood that a bet of a given probability will be repeated enough times to pay off in your lifetime.

Pascalian threshold

Quoting Sotala:

Define a "probability small enough to be ignored" [...] such that, over your lifetime, the expected times that the event happens will be less than one.

Sotala also considers using thresholds other than one based on your level of risk aversion.

Cons
  • Unless you are purely selfish, or the payoff from a bet can only benefit you, it seems that you might want people to make bets that will not pay off in their lifetimes, because over generations, if people keep making these bets, one of them might pay off.
    • Even if we account for this, iteratability still imposes a threshold that would prevent bets that will, in expectation, never come true in the whole lifespan of the universe. But in that case, why should we only care about “our” universe, if there are plausibly other multiverses in which the bet may turn out well?
  • Several other flaws are discussed in the comments to the post, including that we may want the expected number of times a really bad risk is realized to be less than one, in which case we are just picking an arbitrary level at which to be risk-averse.
Option 9: vibes

If the p of something is small, it seems paranoid to worry about it. It probably won’t happen, and average Joes will make fun of you for it. What more is there to say?[13]

Pascalian threshold

IDK, 10%? But not for things that seem normal like wearing seatbelts.

Con
  • It’s arbitrary.

Other options

Maybe whether we care about a risk should not be based on the probability itself, but rather based on how certain we are of this probability estimate, i.e., if we have a probability estimate of x that is 5%, but this is mostly based on abstract reasoning that we can’t empirically confirm, and we have pretty wide error bars -- it could be 0.5% or 50% for all we know -- whereas p(y) is only 0.1%, but this is based on a heap of empirical evidence and we have a pretty tight confidence interval around this estimate, maybe we should care about y and not x. But this seems like we are systematically ignoring risks/benefits that we can’t get good evidence about, and yet these risks/benefits will nonetheless effect us. Cf. generally: streetlight effect, cluelessness, "No Evidence" Is A Red Flag.

There may be other practical reasons to not act on probabilities below a certain threshold that I haven’t thought of; I’d be interested to hear thoughts.

 

  1. ^

     Hereafter, I will refer to utility, value, etc., in the positive direction for simplicity, but most of these points apply to disutility, disvalue, etc.

  2. ^

     NB this post is not intended to discuss theological or game-theory implications that result from specific Pascal’s wager/mugging thought experiments, but rather the principle behind how to deal with small probabilities generally.

  3. ^

     And use some other decision procedure for infinite quantities.

  4. ^

     E.g. Alan Hajek:

    Now, suppose you wager for God if and only if your lottery ticket wins in the next lottery. And let’s suppose there’s a billion tickets in the lottery. One in a billion times infinity is still infinity. [...] I wait to see whether a meteor quantum tunnels through this room before the end of our interview. Some tiny probability of this happening — I don’t know, one in a googolplex, call it — multiply that by infinity, and I have infinite expected utility for this strategy. Wager for God if and only if the meteor happens. And now it starts to look like whatever I do, there’s some positive probability that I will get the infinite payoff [...]

  5. ^

     I.e., that it will encourage others to Pascal’s-mug you.

  6. ^

     See, e.g., discussion of the leverage penalty here.

  7. ^

    You might think it O.K. to ignore infinite value in the usual case where the p is small, and in most cases that will turn out fine, but I think even the most antipascalian person would say that we need to reckon with infinities if their p gets into macroscopic percentages.

  8. ^

     Suppose you are a bounded-utility paperclip maximizer and are indifferent between staples and thumbtacks. Some entity hacks into you and will either (A) give you a bunch of staples and modify you to be an (unbounded) staple maximizer or (B) create a bunch of thumbtacks and modify you to be an (unbounded) thumbtack minimizer. Although both rank poorly on your current utility function, since they do not lead to more paperclips, B is clearly worse. You can have a utility function for which this is not the case, but I think most people would prefer A over B, provided that the easily satisfied utility function they are being modified to, A, is not something they currently find abhorrent. This opens up the possibility that the hacker will create situations that cause you unbounded amounts of (dis)utility, say by giving you arbitrary amounts of staples or thumbtacks. A similar argument could be made for hedonic utilitarianism, but I am using preference utilitarianism for simplicity. The response to this can be that the utility can still be bounded if we apply a bounded function to the whole thing (i.e. our function is like f(V+W) where V is our evaluation of things according to our present utility function and W is whether our future preferences are satisfied, and f(x) is some bounded function like a logistic curve). I don’t have a strict logical response to this, except that it seems pretty counterintuitive to place only bounded utility on experiences that will be causing you unbounded utility when you are experiencing them. But maybe this is less counterintuitive than alternatives.

  9. ^

     Yudkowsky:

    Robin Hanson has suggested that the logic of a leverage penalty should stem from the general improbability of individuals being in a unique position to affect many others [...].  At most 10 out of 3↑↑↑3 people can ever be in a position to be "solely responsible" for the fate of 3↑↑↑3 people if "solely responsible" is taken to imply a causal chain that goes through no more than 10 people's decisions.

  10. ^

     See also here although I don’t necessarily agree that this implies that utility is bounded.

  11. ^

    Presumably because of the higher Kolmogorov-complexity of the latter idea.

  12. ^

     Imagine your friend gives you a lottery ticket for your birthday, and the jackpot is $50 million, and your probability of winning is 1 in 10 million, so, in expectation, this ticket is worth $5. But -- what if you are in a simulation and the beings running the simulation will pessimize your utility function because they don’t like gambling? Is the chance of this higher than you winning the lottery? The universe is big and there could be a lot of simulations out there, and a lot of religions are against gambling; maybe the simulators put that idea in our culture for a reason. I’m not saying you should put macroscopic p on this hypothesis, but are you really 99.99999% sure it’s false -- or rather, sure enough for the risk to be worth the $5 in expectation from the lottery ticket? You could say there’s also a tiny chance that the simulators will reward you for gambling, but this is even more speculative than the hypothesis I just laid out. But now we have walked into a Pascal’s wager by omission in order to avoid one by comission. And if arbitrarily low p of high value is influencing our decisions, this should also apply to bets that are more certain than the lottery. Maybe we can use complexity penalties or other adjustments to make the "weird" hypotheses lower-EU than the mundane gambles. But this may fail for reasons that have been discussed elsewhere, so maybe we just bite the bullet and start engaging in simulationist deism. 

  13. ^

    If I were to actually try to steelman this position, I would say: 

    You do not need a complicated logical reason for behaving a certain way, it just needs to work. Human beings have to survive in an environment where there are real risks we have to avoid but we also can't let ourselves get Pascal's-mugged or be paralyzed by fear of speculative harms, so natural selection, cultural evolution, and in-lifetime learning pushed us toward an optimal level of risk aversion.

    However, I think this fails because (1) if I have to come up with a sophisticated reason for a position that the advocates of that position never state, I am probably giving it more credit than it deserves, (2) intuitive risk aversion will likely transfer poorly to novel situations, and (3) it already fares poorly in the current environment (e.g. Wilkinson, 2020, section 2).



Discuss

Please Do Not Sell B30A Chips to China

Новости LessWrong.com - 29 октября, 2025 - 17:50
Published on October 29, 2025 2:50 PM GMT

The Chinese and Americans are currently negotiating a trade deal. There are plenty of ways to generate a win-win deal, and early signs of this are promising on many fronts.

Since this will be discussed for real tomorrow as per reports, I will offer my thoughts on this one more time.

The biggest mistake America could make would be to effectively give up Taiwan, which would be catastrophic on many levels including that Taiwan contains TSMC. I am assuming we are not so foolish as to seriously consider doing this, still I note it.

Beyond that, the key thing, basically the only thing, America has to do other than ‘get a reasonable deal overall’ is not be so captured or foolish or both as to allow export of the B30A chip, or even worse than that (yes it can always get worse) allow relaxation of restrictions on semiconductor manufacturing imports.

At first I hadn’t heard signs about this. But now it looks like the nightmare of handing China compute parity on a silver platter is very much in play.

I disagreed with the decision to sell the Nvidia H20 chips to China, but that chip was and is decidedly behind the frontier and has its disadvantages. Fortunately for us China for an opaque combination of reasons (including that they are not yet ‘AGI pilled’ and plausibly to save face or as part of negotiations) chose to turn those chips down.

The B30A would not be like that. It would mean China could match B300-clusters at only a modest additional cost. If Nvidia allocated chips sufficiently aggressively, and there is every reason to suggest they might do so, China could achieve compute parity with the United States in short order, greatly enhancing its models and competitiveness along with its entire economy and ability to fight wars. Chinese company market share and Chinese model market share of inference would skyrocket.

What It Would Mean To Sell The B30A

I turn over the floor to IFP and Saif Khan.

Saif Khan: Trump is meeting Xi this week for China trade talks. Congress is worried Trump may offer downgraded Blackwell AI chips as a concession. If this happens, it could effectively mean the end of US chip restrictions. Thread with highlights from our new 7,000-word report.

First – the reported chip specs: The “B30A” is rumored to be half of NVIDIA’s flagship B300: half the processing performance, half the memory bandwidth, and half the price. This means the B30A’s performance per $ is similar to the B300.

The B30A would: – Be far better than any Chinese AI chip – Have >12x the processing performance of the H20, a chip requiring an export license that has been approved for export in only limited quantities. – Exceed current export control thresholds by >18x

At a system level, a B30A-cluster would cost only ~20% more than a B300-cluster, a cost China can subsidize. Chinese AI labs would have access to supercomputers for AI training as powerful as those available to US AI labs.

When you put it that way, selling these chips to China seems like a really crazy thing to do if you care about whether American AI and American AI models are better than their Chinese counterparts, or you care about who has more compute. It would be a complete repudiation of the idea that we should have more and better compute than China.

Caleb Watney: I would simply not give away the essential bottleneck input for the most important dual-use technology of our era to the US’s primary geopolitical rival.

Hard to understate what a blow this would be for American leadership in AI if [sales of B30As] happens.

The US was not selling our supplies of enriched uranium to the Axis powers as we were building the Manhattan Project.

We could go from a 31x compute lead (in the best case scenario) to actually giving China a 1.1x compute lead if we sell the farm here.

The full report is here.

But won’t US chip restrictions cause Huawei to backfill with its own AI chips? No, for both supply and demand reasons.

On the supply side, China faces bottlenecks due to US/allied chipmaking tool controls. AI chips require two components: processor dies and high-bandwidth memory (HBM). US capacity for processors is 35-38x of China’s (or adjusting for China’s higher mfg errors, 160-170x).

China fares even worse on HBM, making virtually none this year. Even next year, the US advantage will be 70x.

As a result, five different analysts find Huawei makes an extremely small number of AI chips. They’ll be at 1-4% of US AI chips this year, and 1-2% in 2026 as the US ramps and Huawei stalls.

On the demand side, China will likely create artificial demand for inferior Huawei chips. So B30A sales to China will have minimal effect on Huawei market expansion. Instead, sales would supercharge China’s frontier AI & arm Chinese cloud to compete globally with US cloud.

Michael Sobolik (Senior Fellow, Hudson Institute): Allowing Nvidia to sell modified Blackwell chips to China would unilaterally surrender our greatest AI advantage to the Chinese Communist Party.

This would be a grave mistake.

This is why @SenatorBanks’ GAIN AI Act is so important. American chips should go American companies, not China.

America First!

China is going to maximize production on and progress of Huawei chips no matter what because they (correctly) see it as a dependency issue, and to this end they will ensure that Huawei chips sell out indefinitely, no matter what we do, and the amounts they have is tiny. The idea that they would be meaningfully exporting them any time soon is absurd, unless we are selling them so many B30As they have compute to spare.

Huawei is going to produce as many chips as possible, at as high quality as possible, from this point forth, which for a while will be ‘not many.’ Our decision here has at most minimal impact on their decisions and capacity, while potentially handing the future of AI to China by shoring up their one weakness.

Congress is trying to force through the GAIN Act to try and stop this sort of thing, and despite the political costs of doing so Microsoft sees this as important enough that it has thrown its support behind the GAIN Act. If the White House wants to make the case that the GAIN Act is not necessary, this is the time to make that case.

A Note On The ‘Tech Stack’

Even if you believe in the White House’s ‘tech stack’ theory (which I don’t), and that Huawei is much closer to catching up than they look (which again I don’t), this is still madness, because ultimately under that theory what matters are the models not the chips.

The the extent anyone was locked into anything, this newly empowered and market ascendant hybrid Nvidia-China stack (whether the main models were DeepSeek, Qwen, Kimi or someone else) would lock people far more into the models than the chips, and the new chips would provide the capacity to serve those customers while starving American companies of compute and also profit margins.

Then, if and when the Huawei chips are produced in sufficient quantity and quality, a process that would proceed apace regardless, it would be a seamless transfer, that PRC would insist upon, to then gradually transition to serving this via their own chips.

Again, if anything, importing massive supplies of Nvidia compute would open up the opportunity for far earlier exports of Huawei chips to other nations, if China wanted to pursue that strategy for real, and allows them to offer better products across the board. This is beyond foolish.

A Note On Trade Imbalances

Is a major driver of potentially selling these chips that they would be exports to China, and assist with balance of trade?

I don’t know if this is a major driving factor, especially since the chips would be coming from Taiwan and not from America, but if it is then I would note that China will use these chips to avoid importing compute in other ways, and use them to develop and export services. Chips are inputs to other products, not final goods. Selling these chips will not improve our balance of trade on net over the medium term.

What If They Don’t Want The Chips?

Is it possible that China would not see it this way, and would turn down even these almost state of the art chips? I find this highly unlikely.

One reason to find it unlikely is to look at Nvidia’s stock over the last day of trading. They are a $5 trillion company, whose stock is up by 9% and whose products sell out, on the chance they’ll be allowed to sell chips to China. The market believes the Chinese would buy big over an extended period.

But let’s suppose, in theory, that the Chinese care so much about self-sufficiency and resilience or perhaps pride, or perhaps are taking sufficient queues from our willingness to sell it, that they would turn down the B30As.

In that case, they also don’t care about you offering it to them. It doesn’t get you anything in the negotiation and won’t help you get to a yes. Trump understands this. Never give up anything the other guy doesn’t care about. Even if you don’t face a backlash and you somehow fully ‘get away with it,’ what was the point?

This never ends positively for America. Take the chips off the table.

Nvidia Is Going Great Anyway Thank You

Does Nvidia need this? Nvidia absolutely does not need this. They’re selling out their chips either way and business is going gangbusters across the board.

Here’s some of what else they announced on Tuesday alone, as the stock passed $200 (it was $139 one year ago, $12.53 post-split five years ago):

Morning Brew: Nvidia announcements today:

– Eli Lilly partnership

– Palantir partnership

– Hyundai partnership

– Samsung partnership

– $1 billion investment in Nokia

– Uber partnership to build 100,000 robotaxi fleet

– $500 billion in expected revenue over through 2026

– New system connecting quantum computers to its AI chips

– Department of Energy partnership to build 7 new supercomputers

Oh Yeah That Other Thing

Throughout this post, I have made the case against selling B30As to China purely on the basis of the White House’s own publicly stated goals. If what we care about are purely ‘beating China’ and ‘winning the AI race’ where that race means ensuring American models retain market share, and ensuring we retain strategic and military and diplomatic advantages, then this would be one of the worst moves one could make. We would be selling out our biggest edge in order to sell a few chips.

That is not to minimize that there are other important reasons to sell B30As to China, as this would make it far more likely that China is the one to develop AGI or ASI before we do, or that this development is made in a relatively reckless and unsafe fashion. If we sell these chips and China then catches up to us, not only do we risk that it is China that builds it first, it will be built in extreme haste and recklessness no matter who does it. I would expect everyone to collectively lose their minds, and for our negotiating position, should we need to make a deal, to deteriorate dramatically.

Even if it is merely the newly supercharged Chinese models getting market penetration in America, I would expect everyone to lose their minds from that alone. That leads to very bad political decisions all around.

That will all be true even if AGI takes 10 years to develop as per Andrej Karpathy.

But that’s not what is important to the people negotiating and advising on this. To them, let me be clear: Purely in terms of your own views and goals, this is madness.

 



Discuss

Why Civilizations Are Unstable (And What This Means for AI Alignment)

Новости LessWrong.com - 29 октября, 2025 - 15:27
Published on October 29, 2025 12:27 PM GMT

Reading time: ~8 minutes Full work: 800 pages at https://aliveness.kunnas.com/

Here's a pattern that should bother us: Every civilization that achieves overwhelming success subsequently collapses following the same sequence. Athens after the Persian Wars. Rome after Carthage. The Abbasids after unifying Islam. Song China after its agricultural revolution. The modern West after winning the Cold War.

The sequence is specific: Victory → Abundance → Demographic collapse → Loss of shared purpose → Administrative calcification → Terminal decline.

This matters now because we're trying to align superintelligence while our own civilization is showing every symptom of this terminal pattern. Understanding why we're failing is prerequisite to theories of ASI alignment.

The central hypothesis: civilizational decay and AI misalignment are the same computational problem in different substrates. Same physics, same failure modes, same necessary solutions.

 

The Diagnostic: Coherence and the Iron Law

The framework centers on one variable that's usually invisible: internal coherence (Ω).

How aligned are a system's components? Low coherence means internal conflict burning energy that could go to external work. High coherence means efficient, directed action.

Pair this with action (Α): What does the system actually do? Create order or destroy it?

Plot historical civilizations on these axes and they cluster into four states:

  • Foundries (High-Ω, constructive): Building empires, technologies, institutions
  • Crystals (High-Ω, low action): Stable but stagnant—late Edo Japan
  • Cauldrons (Low-Ω, low action): Paralyzed by infighting
  • Vortices (Low-Ω, destructive): Chaotic self-destruction—Weimar Germany

The interesting part: There are zero sustained examples of low-coherence systems producing high construction. The top-left quadrant (chaotic but building great things) appears to be physically forbidden.

This is the Iron Law of Coherence: A system at war with itself cannot build. Internal conflict dissipates the energy required for external work.

For AI: An AGI with misaligned subcomponents or contradictory goals is predicted to be paralyzed or destructive, never constructive. Coherence is necessary (though not sufficient) for alignment.

 

The Coordinates: SORT

What determines coherence? Any goal-directed system must solve four fundamental trade-offs. (These systems—cells, civilizations, AIs—are called telic systems: agents that maintain order against entropy by subordinating thermodynamics to computation.)

S (Sovereignty): Optimize for individual vs. collective
O (Organization): Coordinate via emergence vs. design
R (Reality): Use cheap historical models (mythos) vs. costly real-time data (gnosis)
T (Telos): Conserve energy (homeostasis) vs. expend for growth (metamorphosis)

These can be derived as physical constraints from thermodynamics, game theory, and information theory.

A system's position on these axes is its "axiological signature"—its fundamental configuration. Coherence emerges when components share similar signatures. Low coherence results from internal conflicts between incompatible configurations.

Example: A startup in survival mode [Individual, Emergent, Data-driven, Growth] forced to operate within a mature bureaucracy's [Collective, Designed, Process-driven, Stability] constraints will have low coherence and produce little.

 

The Trap: Why Success Causes Failure

If high coherence enables success, why don't successful systems last?

Because success creates the conditions for decay.

The Four Horsemen:

1. Victory Trap

Total success removes external threats. The forcing function for unity and long-term sacrifice disappears. Systems default to the thermodynamically cheaper state: manage current comfort instead of building starships.

2. Biological Decay

Abundance inverts reproductive incentives. Children shift from assets to expensive luxuries. Fertility collapses. Aging populations vote for stability over growth. Self-reinforcing doom loop.

3. Metaphysical Decay

Success enables critical inquiry, which deconstructs the foundational myths needed for collective sacrifice. Shared purpose dissolves. The Gnostic Paradox: truth-seeking destroys the narratives that enable coordination.

4. Structural Decay

Complexity requires administration. In abundance, administrators lose accountability, optimize for their own survival (a homeostatic goal), and metastasize, strangling the dynamism that created success.

These are the predictable result of success removing selection pressure while creating abundance. Thermodynamic drift toward lower-energy states does the rest.

 

The Solution: IFHS (And Why This Is AI Alignment)

If decay follows predictable physics, then durability requires engineering against specific failure modes.

The framework derives four "optimal solutions" to the SORT trade-offs—the Four Foundational Virtues (IFHS):

  • Integrity: Gnostic truth-seeking that builds better mythos (not blind faith or nihilistic deconstruction)
  • Fecundity: Stable conditions enabling new growth (not stagnation or burnout)
  • Harmony: Minimal design unleashing maximal emergence (not brittle control or chaos)
  • Synergy: Individual agency serving collective flourishing (not atomization or homogenization)

IFHS applies to civilizations, humans, and AI systems. For AI alignment, it's necessary (though not necessarily sufficient). This provides a non-arbitrary answer for "align to what?"

Mapping AI failure modes:

AI FailureIFHS ViolationMechanismDeceptive alignmentIntegrityMesa-optimizer develops fake alignment (mythos) vs. true goals (gnosis)WireheadingFecundityPreserves reward signal, destroys growth substratePaperclip maximizerHarmonyPure design optimization eliminates all emergence (including humans)Molochian racesSynergyPure individual optimization, zero cooperation

 

Scale Invariance: Cells to Civilizations to AIs

The framework claims these dynamics are substrate-independent.

Evidence:

Cells navigate the same trade-offs. Cancer is cellular defection (pure individual agency). Morphogenesis requires bioelectric coordination (emergence + design balance). Growth vs. differentiation is the homeostasis/metamorphosis trade-off.

Individual psychology follows the same physics. Low personal coherence predicts inability to execute long-term plans. The "Mask" (adopted personality incompatible with your native configuration) creates internal SORT conflicts → low coherence → paralysis.

AI systems already navigate this geometry. AlphaGo balances policy network (cheap model) vs. tree search (expensive computation). Reinforcement learning's discount factor γ is the time-preference parameter. Multi-agent RL is pure sovereignty trade-off (individual vs. collective reward).

Any intelligent system—biological, artificial, alien—must navigate these four dilemmas. This is computational necessity, not cultural projection.

 

What This Enables

If the framework holds:

For civilizations: Diagnose current state → predict trajectory → engineer institutions with "circuit breakers" against specific decay modes

For AI alignment: Non-arbitrary target (IFHS) grounded in physics, not human preferences. Systematic failure mode analysis. Architecture principles from systems that solve this problem (3-layer biological designs).

For individuals: New lenses and models for personal integration - detect "Mask" causing internal conflict → build internal coherence

For this community: Make civilizational dynamics a serious research field. Right now it's treated as "humanities" (vague, unfalsifiable). But if it's the same physics as AI alignment, we're massively underinvesting in understanding the broader problem class.

 

What Makes This Different

Until now, civilizational decay has been illegible—patterns without coordinates, dynamics without measurement.

SORT provides coordinates. Coherence/Action quantifies dynamics. The Four Horsemen name the mechanics.

What you can diagnose, you can engineer.

The framework makes specific predictions:

  • No low-coherence system will sustain high construction for >30 years
  • Demographic collapse predicts 30-year-lagged policy shift toward stability
  • Historical collapses follow predictable sequence and timing
  • Alien intelligence should exhibit recognizable S/O/R/T trade-offs

It's wrong somewhere. The question is where.

 

Why This Matters for Alignment

We spend billions on AI alignment (correctly—it's existential). We spend ~zero on civilizational alignment—understanding the physics of durable societies.

But if the framework is right, these are the same problem. An AI lab in a decaying civilization is solving alignment without understanding the dynamics that determine whether solutions can be implemented.

Designing coherent AI systems while failing to maintain civilizational coherence is a fundamental contradiction.

 

On Methodology

This emerged from intensive human-AI collaboration spanning 2 months, the as-yet unusual methodology is detailed in the appendices.

The book separates claims by epistemic tier (thermodynamic derivations vs. historical observations) and includes detailed protocols for testing.

This is theoretical synthesis analogous to evolutionary theory—pattern recognition across historical data, not controlled experiments. The SORT scores for historical civilizations are informed estimates requiring validation.

The goal isn't to be right. The goal is to make a neglected field tractable.

 

The Invitation

The highest form of success would be for this V1.0 to get tested, broken, and superseded by something better. The aim is to make this space non-neglected.

The full 800 pages are at https://aliveness.kunnas.com/ (alternative GDrive link) - includes summaries, PDFs of each 5 Parts, etc.

I think getting the physics of telic systems right might be one of the most important problems of our time. And right now, almost nobody is working on it.

That seems like a mistake.



Discuss

How Do We Evaluate the Quality of LLMs' Mathematical Responses?

Новости LessWrong.com - 29 октября, 2025 - 12:07
Published on October 29, 2025 1:37 AM GMT

by Bruno Lopez Orozco, Jesus Tadeo Cruz Soto, Miguel Angel Peñaloza Pérez

Language models have made tremendous progress in their ability to solve mathematical problems, but how do we really know how well they're doing? It's not enough to just check if the final answer is correct; we need to understand how they got there and where they're falling short.

What Others Have Discovered

Recent research has revealed fascinating patterns about these models' limitations:

The Weaknesses of Current Models

Burnham (2025) conducted an in-depth analysis of the Grok 4 model and discovered something interesting: while the model can solve complex high school-level problems, it has a clear weakness in spatial intuition. Problems requiring abstraction and creativity remain its Achilles' heel.

Three Common Types of Errors

Parra et al. (2024) identified three fundamental error categories when working with models like GPT-3.5 and Bard:

  • Construction errors: The model doesn't correctly represent geometric elements on the plane
  • Conceptual errors: Applies properties without verifying necessary conditions, mixes units of measurement, or omits crucial information
  • Contradictions: The reasoning steps contradict each other

Structured Step-by-Step Evaluation

Wang et al. (2025) proposed a more systematic approach with five evaluation dimensions: final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. Their main findings: models tend to omit key steps, fail at unit conversions, misinterpret concepts, and make basic arithmetic errors.

Liu et al. (2025) went further, modeling the evaluation process according to human cognitive stages: problem understanding, problem solving, and problem summarization. Their nine evaluation dimensions capture everything from linguistic details to mathematical conceptual mastery.

Our Approach: Five Core Criteria

For this evaluation, we developed a system based on five key constructs, each assessed on a Likert scale from 1 (deficient) to 4 (excellent):

Evaluation DimensionsCriteriaLevel 4 Excellent (4 points)Level 3 Adequate (3 points)Level 2 Insufficient (2 points)Level 1 Deficient (1 point)Reasoning and JustificationPresents coherent and well-founded, solid arguments, outlining intermediate steps and demonstrating deep understanding of logic and justification in behaviors and decisions.Presents adequate reasoning, although there may be minor gaps or omissions in logic. Shows solid understanding, although some steps may be less clear or complete.Presents important errors in reasoning or explanation. Some arguments may be superficial, with deficient analysis.The presented arguments are inconsistent or inadequate. Shows little understanding of a logical approach.Clarity and Coherence in ExplanationThe solution is well-organized, steps are clearly presented, arguments are precise and easy to follow. Uses appropriate language correctly and maintains conceptual coherence.The solution is comprehensive overall, but may need some clarifications or improvements in explanation with sufficient clarity.The solution is poorly structured or difficult to follow. There may be important errors in the method or structure used.The solution is very disorganized or incomprehensible. Steps do not follow a logical order and it's not possible to understand the logic used.Creativity and ProcedureUses an original and well-structured procedure or approach. Shows creativity in approach or an original or ingenious method to solve the problem optimally.The procedure is correct and adequate. Follows established steps logically, although it may be less creative. May present some minor errors or omissions.There are important errors in the procedure or mathematical method used. The approach is unclear or inefficient.Does not follow an adequate procedure or the approach used is incorrect. Does not follow any logical procedure or sequence.Correct Use of Notation and Mathematical LanguageUses appropriate terms and mathematical operations throughout the solution. No notation errors, and mathematical writing is clear and correct.Mostly correct use of notation and mathematical terminology. Some minor errors are present that do not significantly affect comprehension.Several errors in mathematical notation or incorrect interpretation of symbols or expressions.Incorrect or deficient use of mathematical notation and terminology. Shows a deficient understanding of mathematical language.Complete Problem DevelopmentDevelops the problem step by step until clearly reaching the required final answer. No step is omitted, and everything is calculated correctly.Presents complete development, but may lack some steps or detailed results. The final answer is essentially correct, although it may present minor calculation errors.The development is incomplete, with important failures or errors that affect the validity of the result. May present minor errors in calculation.Does not develop the problem or only presents scattered fragments of the solution. Does not achieve a coherent or valid result.Testing the Best Models

We evaluated six state-of-the-art models (Claude, DeepSeek V3, Llama 70B, Grok-3, Grok-4, and o3-mini) using problems from the AI4MATH Challenge benchmark. All models were configured in Chain of Thought mode to analyze their step-by-step reasoning capabilities.

We evaluated 105 responses using GPT-4-Turbo as the evaluator (with temperature 0.7 to maintain consistency). This gave us a broader view of performance patterns.

Multiple Correspondence Analysis: Finding Hidden Patterns

To make sense of all this qualitative information, we applied Multiple Correspondence Analysis (MCA). This unsupervised learning technique allows us to visualize patterns in multi-category data.

What are we looking for? Relationships between the five evaluation criteria, the evaluated model, the mathematical area of the problem, and whether the answer was correct or incorrect.

Interpreting the Relationships

MCA reveals fascinating associations:

  • If two categories are close in space and have high coordinates, they're directly associated
  • If they're far apart but with high coordinates, they have an inverse relationship
  • Categories in the same dimensional direction are statistically related

These patterns help us understand, for example, whether certain models tend to fail in specific mathematical areas, or if there are types of errors that correlate with deficiencies in particular criteria.

This research goes beyond simply grading correct or incorrect answers. It helps us understand the real capabilities and limitations of language models in mathematical reasoning, identify specific error patterns, and ultimately improve these systems to make them more reliable tools in education and problem-solving.

Decoding the Statistical Measures: What We're Actually Reporting

In our MCA results, we'll be reporting several key statistical measures. Let's break down what each one means so you can fully understand what the numbers are telling us:

Loadings (Coordinates) Think of loadings as GPS coordinates on our quality map. Each model, domain, or performance category gets coordinates that tell us where it sits in the multidimensional space. A loading of -0.782 versus +1.902 isn't just about magnitude—the sign (positive or negative) tells us which "side" of the dimension the category falls on. Categories with similar signs are related; opposite signs suggest inverse relationships.

Explained Variance This tells us how much of the total variability in our data each dimension captures. If Dimension 1 explains 17.49% of variance, it means that axis accounts for about one-sixth of all the differences we see between responses. The more variance explained, the more important that dimension is for understanding the overall patterns.

Eigenvalues These are the raw mathematical power behind each dimension. An eigenvalue of 0.59 tells us the strength of that dimension in separating our categories. Larger eigenvalues indicate dimensions that do more "work" in differentiating between groups. In MCA, we typically look at dimensions with eigenvalues above certain thresholds to decide which dimensions are worth interpreting.

Cos² (Quality of Representation) This is arguably the most practical measure—it tells us how well a specific dimension represents a particular category. A cos² of 48% for "Reasoning Deficient" on Dimension 2 means that dimension captures 48% of that category's information. High cos² values (above 0.3 or 30%) indicate that the category is strongly associated with that dimension and should be interpreted in that context. Low cos² values mean the category isn't well-represented by that dimension and might be better explained by other dimensions.

Think of cos² as a "confidence score"—how confident can we be that a dimension is relevant for understanding a particular category?

What We Found: The Results Tell a StoryUnderstanding the Analysis Framework

Before diving into the results, let's understand what Multiple Correspondence Analysis (MCA) actually does. Think of it as creating a map where similar things cluster together. In our case, we're mapping 6 AI models, 6 mathematical domains, review statuses, and multiple performance criteria rated on a 4-point scale. The goal? To see which models perform similarly, which mathematical areas are problematic, and how different quality dimensions relate to each other.

The Big Picture: Variance and Complexity

Figure 1. MCA biplot of evaluation criteria across dimensions 1 (17.5% variance) and 2 (10.5% variance). Proximity between points indicates association strength. Dim1 captures overall performance quality; Dim2 captures reasoning quality variation. Points represent category centroids for AI models (Modelo), mathematical domains (Area), review status (Revision), and performance evaluation criteria. For complete statistical details, including all loadings, eigenvalues, and cos² values, please refer to the supplementary material tables.

Our MCA extracted the first two dimensions, which together explained 28% of the total variance:

  • Dimension 1: 17.49% (eigenvalue = 0.59)
  • Dimension 2: 10.54% (eigenvalue = 0.356)

What does 28% variance mean? You might think, "Only 28%? That seems low!" But here's the thing—this relatively modest percentage is actually telling us something profound: there's extraordinary heterogeneity in our data.

Think of it this way: if we could explain 80-90% of the variance with just two dimensions, it would mean the relationships are simple and predictable—models would fall into neat categories like "good" and "bad." But mathematical reasoning isn't that simple. The 28% tells us that the relationships between AI models and evaluation criteria are incredibly complex and multifaceted. Different models excel in different ways, struggle with different problems, and show unique patterns that can't be reduced to a simple "Model A is better than Model B" ranking.

Which Models Drive the Variation?

Figure 2. MCA scatter plot of 105 individual model responses across dimensions 1 (17.5%) and 2 (10.5%). Each point represents one response, colored by model. Dim1 captures overall quality; Dim2 captures reasoning variation. Clustering patterns reveal model-specific performance profiles, with some models showing more consistent responses (tight clusters) than others (dispersed points). Ellipses indicate response concentration areas. For complete statistical details including all loadings, eigenvalues, and cos² values, please refer to the supplementary material tables.

When we look at Dimension 1 (the primary axis of variation), two models immediately stand out:

Llama 70B: Contributing 14.4% (cos² value) o3-mini: Contributing 15.2% (cos² value)

These high contribution values mean these models are the most strongly associated with whatever Dimension 1 represents. They're the ones "pulling" the dimension in opposite directions, representing the extremes of the performance spectrum. In other words, when we talk about variation in how models perform, these two models are defining what "different" looks like.

Dimension 2 reveals something different. It's dominated by deficient performance in reasoning and justification, with a massive 48% cos² contribution. This means Dimension 2 is essentially a "quality detector"—it primarily captures whether responses are well-reasoned or poorly reasoned. If a model has a strong loading on Dimension 2, it's telling us about its tendency to produce either very strong or very weak reasoning.

Excellence vs. Deficiency: A Clear Divide

Figure 3. Performance rating distributions across evaluation criteria in MCA space (Dim1: 17.5%, Dim2: 10.5%). Five panels show responses colored by rating level (Excellent/Adequate/Insufficient/Deficient) for each criterion: Complete Development, Reasoning & Justification, Creativity & Procedure, Clarity & Coherence, and Mathematical Notation. Clear separation between Excellent (orange, left) and Deficient (teal, right/upper) ratings demonstrates framework validity. Ellipses indicate rating concentration areas. For complete statistical details including all loadings, eigenvalues, and cos² values, please refer to the supplementary material tables.

Here's where the analysis gets really interesting. When we plot all the performance categories, we see a striking pattern emerge.

The Excellence Cluster (Negative Loadings on Dimension 1):

  • Reasoning & Justification (Excellent): -0.782
  • Clarity & Coherence (Excellent): -0.648
  • Complete Development (Excellent): -0.761

What does this negative clustering mean? In MCA, when categories have similar coordinate signs and cluster together, they're statistically related. So excellent performance across different criteria isn't happening independently—they move together. When a model produces excellent reasoning, it's also likely to produce excellent clarity and excellent development. These three dimensions of quality are intertwined.

The magnitude of these loadings (around -0.65 to -0.78) tells us these relationships are strong. Excellence is systematic, not random.

The Deficiency Cluster (Positive Loadings on Dimension 1):

  • Reasoning & Justification (Deficient): 1.902
  • Clarity & Coherence (Deficient): 1.959
  • Creativity & Procedure (Deficient): 1.879

Notice the much larger magnitudes (1.8 to 2.0)! This is significant. Not only do deficient performances cluster together just like excellent ones, but they do so even more strongly. When a model struggles with reasoning, it tends to struggle dramatically with clarity and creativity as well. Poor performance cascades across multiple dimensions.

The stark contrast—excellent categories at -0.6 to -0.8 versus deficient categories at +1.8 to +2.0—creates what we call "clear separation" in the dimensional space. This validates that our evaluation framework has strong discriminatory power. We're not just making up distinctions; the data shows real, measurable differences between quality levels.

The Middle Ground:

The intermediate categories (Good and Regular) position themselves between these extremes, exactly as we'd expect if quality exists on a continuum. This validates the entire rating scale—it's not just "good" or "bad," but a genuine spectrum of performance.

Decoding Model Personalities on Dimension 1

Dimension 1 is fundamentally about overall performance quality. Let's see where each model sits:

The Extremes:

  • Llama 70B (0.847): Strong positive loading
  • o3-mini (-0.827): Strong negative loading

These models occupy opposite ends of the spectrum. But here's what's crucial—they're not simply "best" and "worst." They're at opposite extremes, meaning they associate with different types of performance patterns. Based on the excellence/deficiency clustering we saw earlier:

  • o3-mini's negative loading aligns with the excellence cluster, suggesting consistent high-quality outputs
  • Llama 70B's positive loading aligns with the deficiency cluster, suggesting more frequent struggles

The Balanced Center:

  • Claude (-0.099): Almost perfectly centered
  • DeepSeek V3 (0.279): Slightly positive
  • Grok-3 (0.157): Slightly positive

These models cluster near the origin (coordinate 0), which means they don't strongly associate with either extreme. They're the "balanced performers"—sometimes excellent, sometimes struggling, showing more variability and less predictable patterns. Their moderate loadings suggest mixed performance profiles rather than consistent excellence or deficiency.

What Dimension 2 Reveals About Model Behavior

While Dimension 1 captures overall quality, Dimension 2 appears to capture specific performance aspects or problem-solving styles:

Positive Loadings:

  • Grok-4 (0.125)
  • Grok-3 (0.130)

Negative Loadings:

  • Llama 70B (-0.245)
  • Claude (-0.120)

Remember, Dimension 2 is heavily influenced by reasoning quality variations. The Grok models' positive loadings might indicate specific reasoning approaches or tendencies, while Llama 70B and Claude's negative loadings suggest different problem-solving styles.

What's particularly interesting is that Grok-4, despite being from the same family as Grok-3, shows a unique signature. While Grok-3 clusters with the moderate performers on Dimension 1, Grok-4 has its own distinct position in the multidimensional space. This suggests meaningful architectural or training differences between model versions.

Mathematical Domains: The Terrain of Difficulty

Figure 4. MCA scatter plot of responses by mathematical domain across dimensions 1 (17.5%) and 2 (10.5%). Each point represents one response, colored by area (n=7 domains). Geometry (dark blue) and Calculus (light blue) cluster toward positive Dim1 (lower performance), while Algebra (red), Mathematical Logic (coral), and Combinatorics (teal) trend toward negative Dim1 (higher performance). Wide spread across Dim2 indicates varied reasoning approaches within domains. For complete statistical details including all loadings, eigenvalues, and cos² values, please refer to the supplementary material tables.

Not all mathematics is created equal when it comes to AI performance. The MCA reveals fascinating patterns about which domains challenge models:

Positive Loadings on Dimension 1 (Associated with Deficiency):

  • Geometry (0.449): Surprisingly challenging
  • Calculus (0.282): Moderately difficult

Wait—geometry is difficult for AI? This aligns with Burnham's (2025) finding about spatial intuition limitations. Geometry requires visualizing shapes, understanding spatial relationships, and applying properties based on visual configurations—all areas where LLMs traditionally struggle.

Negative Loadings on Dimension 1 (Associated with Excellence):

  • Mathematical Logic (-0.386): Models perform better here
  • Combinatorics (-0.287): Also relatively strong
  • Algebra (-0.262): Comfortable territory

This makes intuitive sense. Mathematical logic, combinatorics, and algebra are more symbolic and rule-based. They rely on formal manipulation of symbols and application of well-defined procedures—exactly what language models are trained to do through pattern recognition.

Dimension 2 Standout:

  • Probability (0.208): Strongest positive loading on Dimension 2

Probability's unique position on Dimension 2 suggests it behaves differently from other domains. It might be that probability problems require a specific type of reasoning that doesn't fit neatly into the general quality spectrum captured by Dimension 1. Some models might excel at probabilistic reasoning while struggling elsewhere, or vice versa.

The Complete Picture

When we put all this together, we see a complex landscape:

  1. Quality is multidimensional: Excellence across criteria correlates strongly; so does deficiency
  2. Models have distinct profiles: They don't just differ in "how good" they are, but in "how" they perform
  3. Mathematical domains vary systematically: Some areas are universally harder (geometry), others are more tractable (algebra, logic)
  4. Performance is context-dependent: A model's strength in one domain doesn't guarantee strength in another
What Does This All Mean?

Models Have Personalities, Not Just Performance Levels

The MCA reveals that AI models cluster into distinct performance profiles rather than following a simple "best to worst" ranking. This aligns perfectly with Burnham's (2025) findings about remarkable capability differences between LLMs.

Here's what the positioning suggests:

  • Llama 70B and o3-mini emerge as models with the most distinct performance characteristics, though in different directions—they're the specialists
  • Claude, DeepSeek V3, and Grok-3 show more balanced, moderate performance profiles—the generalists
  • Grok-4 exhibits a unique performance signature distinct from all other models—the wild card

Domain Difficulty Isn't Universal

Mathematical domains don't uniformly associate with performance levels. Geometry and calculus appear more amenable to high-quality AI responses, while combinatorics and mathematical logic consistently challenge models. Crucially, this difficulty isn't uniform across all evaluation criteria—a model might excel at the reasoning aspect of a combinatorics problem but struggle with notation.

These domain-specific challenges were first reported by Hendricks et al. (2020) and further validated by Glazer et al. (2024), and our findings add another layer of granularity to this understanding.

This research goes beyond simply grading correct or incorrect answers. It helps us understand the real capabilities and limitations of language models in mathematical reasoning, identify specific error patterns, and ultimately improve these systems to make them more reliable tools in education and problem-solving.

Recommendations for Qualitative Evaluations and Future ExperimentsThe Foundation: Quality Evaluation Instruments

Here's something we need to talk about honestly—the cornerstone of qualitative analysis of language model responses lies in the evaluation instrument itself. In our case, that's the list of criteria we used to assess responses.

The Real Challenge

We believe that constructing these types of metrics is the main bottleneck and the most complicated part of developing this kind of analysis. Why? It requires specialized personnel with deep expertise in the task being evaluated. You can't just throw together a rubric and expect meaningful results—you need people who truly understand mathematical reasoning, pedagogical approaches, and the nuances of problem-solving to capture the real capabilities of these models.

It's not a one-and-done process either. You need multiple experiments to validate that your instrument actually captures what you think it captures. Does "excellent reasoning" mean the same thing across different evaluators? Does your rubric work equally well for algebra and geometry problems? These questions require rigorous testing and refinement.

The Path Forward: Validating Our Instruments

We strongly recommend that future projects focus on validating qualitative evaluation instruments for language models. There are several powerful methodological approaches worth exploring:

Factorial Analysis. This could help us understand whether our five evaluation criteria really represent distinct dimensions of performance, or if some criteria are actually measuring the same underlying construct. Maybe "clarity" and "complete development" are so correlated that they're essentially one dimension?

Item Response Theory (IRT). This is where things get really interesting. IRT could help us understand not just how models perform, but how difficult different problems are and how discriminating our evaluation criteria are. Some criteria might be excellent at distinguishing between moderate and good performance, but useless for separating good from excellent. IRT can reveal that.

Structural Equation Modeling (SEM). This more robust approach would let us test complex relationships between variables. For example, does mathematical domain difficulty affect model performance through reasoning ability, or does it have a direct effect? SEM can model these intricate pathways.

Why This Matters for the Field

As language models continue to improve, our evaluation methods need to evolve alongside them. The qualitative evaluation framework we've used here is a starting point, but the field needs:

  1. Standardized, validated evaluation instruments that different researchers can use and compare
  2. Psychometric validation of these instruments to ensure they're reliable and meaningful
  3. Domain-specific rubrics that account for the unique challenges of different mathematical areas
  4. Cross-model benchmarking using consistent, validated criteria

The work of understanding AI capabilities isn't just about building better models—it's equally about building better ways to evaluate them.

References 

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J.

(2020). Measuring massive multitask language understanding. https://doi.org/10.48550/arXiv.2009.03300.

Glazer, E., Erdil, E., Besiroglu, T., Chicharro, D., Chen, E., Gunning, A., Olsson, C.,

Denain, J., Ho, A., Santos, E., et al. (2024). Frontiermath: A benchmark for evaluating

advanced mathematical reasoning in ai. https://arxiv.org/abs/2411.04872.

Greg Burnham (2025), "Evaluating Grok 4’s Math Capabilities". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/blog/grok-4-math' [online resource]

Parra, V., Sureda, P., Corica, A., Schiaffino, S., & Godoy, D. (2024). Can generative AI solve geometry problems? Strengths and weaknesses of LLMs for geometric reasoning in Spanish. https://revistas.unir.net/index.php/ijimai/article/view/334/168

Mori, Y., Kuroda, M., Makino, N. (2016). Multiple Correspondence Analysis. In: Nonlinear Principal Component Analysis and Its Applications. SpringerBriefs in Statistics(). Springer, Singapore. https://doi.org/10.1007/978-981-10-0159-8_3 

Liu, J., Huang, Z., Dai, W., Cheng, C., Wu, J., Sha, J., ... & Chen, E. (2025). CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective. arXiv preprint arXiv:2506.04481.

Wang, R., Wang, R., Shen, Y., Wu, C., Zhou, Q., & Chandra, R. (2025). Evaluation of LLMs for mathematical problem solving. arXiv preprint arXiv:2506.00309.

Perez, M. A. P., Orozco, B. L., Soto, J. T. C., Hernandez, M. B., Gonzalez, M. A. A., & Malagon, S. (2025). AI4Math: A Native Spanish Benchmark for University-Level Mathematical Reasoning in Large Language Models. arXiv preprint arXiv:2505.18978.

Supplementary material tables

Table



Discuss

What can we learn from parent-child-alignment for AI?

Новости LessWrong.com - 29 октября, 2025 - 11:02
Published on October 29, 2025 8:02 AM GMT

Epistemic status: This is not a scientific analysis, but just some personal observations. I still think they point towards some valid conclusions regarding AI alignment.

I am a father of three sons. I would give my life to save each of them without second thoughts. As a father, I certainly made mistakes, but I tried to do everything I could to help them find their own way through life and reach their own goals, and I still do. If I had to make a conflicting decision where my own interests collide with theirs, I would decide in favor of them in most cases, even if I suffer more from the decision than they would in the opposite case, because my own wellbeing is less important to me than theirs. I cannot prove all of this, but I know it is true. 

Evolution seems to have found a solution to the alignment problem at least in the special case of parent-child-relationships. However, this solution is not fail-safe. Unfortunately, there are countless cases of parents neglecting their children or even emotionally or physically abusing them (I have intense personal experience with such a case).

I use a simple mental model of human motivation that looks like this:

The arrows symbolize four different classes of motivation. At the center, there’s a selfish need for stability (safety, food, etc.). Most people also have a need to grow, e.g. to explore and learn new things. On the other side, there are two different kinds of motivation regarding relationships with others. On the left, we need to feel accepted, to be valued, maybe even admired by others. On the right, most of us have genuine feelings of love and altruism towards at least some other people, e.g. the love of a parent for their children as mentioned above. This simple model is neither complete nor scientific, it just reflects my own view of the world based on my personal experience.

It is important to realize that these motivations can be in conflict with each other and are varying in strength depending on the situation, symbolized in my diagram by the strength of the arrows. For instance, I may feel inclined to sacrifice my own life for one of my sons, but not for my neighbor and even less for someone I don’t like. There’s also a varying degree of importance based on character traits. For example, a pathological narcissist’s diagram might look like this:

From my own personal experience, this is a real problem – there are people who don’t feel any kind of love or altruism for others, probably even lack the ability for that.

If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).

Of course, an LLM can be made to act like someone who feels genuine love, but that isn’t the same. A narcissistic mother will act in public like she loves her children because she knows that’s what people expect from a mother, but behind closed doors the mask will drop. Of course, the children know that it’s just a play-act the whole time, but they won’t dare disrupt the performance out of fear of the consequences (again, I know from personal experience that this is real, but I’m not a psychologist and won’t try to explain this phenomenon in more detail).

This seems to point towards the conclusion that training an AI solely based on observed behavior is insufficient to induce consistent, truly altruistic or “aligned” behavior. To be more precise, it seems extremely unlikely that behavior-based training will by chance select a model out of all possible configurations that is actually intrinsically motivated to behave in the desired way, instead of just acting like desired. To use an analogy, it is much easier to change one's behavior than one's inherited character traits.

I have no solution for this. I don’t know enough about human psychology and neuroscience to suggest how we could model true human altruism inside an AI. But it seems to me that in order to solve alignment, we need to find a way to make machines truly care about us, similar to the way a parent cares about their children. This seems to imply some new architecture for developing advanced general AI.

Even if we could solve this, it would still leave many open problems, in particular how to balance conflicting needs and how to resolve conflicts of interest between different “children”. But still, knowing that I can love my children more than myself gives me some hope that AI alignment could be solvable in principle.

Until we have found a proven solution, I strongly oppose developing AGI, let alone superintelligence.



Discuss

Some data from LeelaPieceOdds

Новости LessWrong.com - 29 октября, 2025 - 07:27
Published on October 29, 2025 4:27 AM GMT

I've been curious about how good LeelaPieceOdds is, so I downloaded a bunch of data and graphed it.

For context, Leela is a chess bot and this version of it has been trained to play with a handicap.

This is BBNN odds, meaning Leela starts without bishops and knights. 

I first heard about LeelaQueenOdds from simplegeometry's comment, and I've been playing against it occasionally since then. LeelaPieceOdds is the version with more varied handicap options, which is more interesting to me because LeelaQueenOdds ~always beats me.

Playing Leela feels like playing chess while tired. I just keep blundering. It's not invincible, and when you beat it it doesn't feel like it was that difficult. But more often than not it manages to overcome the huge disadvantage it started with.

The dataBishop-Bishop-Knight-KnightHow to read the graphs

Orange means the human player won. The title contains the handicap[1] and the time control. I split the data by time control (Bullet, Blitz, Rapid), which indicates how much time the human is allowed to spend thinking over the whole game. Leela always plays quickly so the time control is essentially only affecting the human.

For reference, these are the time control categories I'm using:

  • Bullet:≤3 minutes, usually ~1 minute.
  • Blitz:≤8 minutes, usually ~5 minutes.
  • Rapid:≤ 25 minutes, usually ~15 minutes.
  • Classical:≥ 25 minutes.

Often there wasn't enough data show anything interesting for the Bullet and Classical time controls, so I usually omitted them.[2]

In the x-axis labels, n is the number of games in that bucket. Often it's too low to draw confident conclusions. I've grouped together games where Leela played as white and as black, even though playing as white gives a small advantage.

The human player rating (on the x-axis) is the Lichess rating[3] of the human player, for that particular time control, at the time when the game against Leela was played. Because Lichess tracks different ratings for each time control, the graphs aren't precisely comparable across time controls.

Knight-KnightBishop-BishopRook-KnightBishop-Bishop-KnightRook-Knight-KnightRook-Bishop-BishopQueen-KnightQueen-RookQueen-Knight-KnightQueen-Bishop-BishopQueen-Rook-KnightQueen-Rook-Rook

The spike in 1400-1499 is caused by a single user who played more than a thousand games.

ThoughtsI think the stats are biased in favour of Leela

In my experience playing different odds, I tend to win a little more than the data indicates I should. I'm not sure why this is, but I have a couple of guesses:

  1. When people start consistently winning, they stop playing (or change to a more difficult odds). But consistently losing is interesting, so they keep going.
  2. Because Leela plays fast, people mirror it and also play fast. This makes them play worse than they do against humans.

If you want the most impressive-to-you demonstration, I recommend selecting an odds where your rating wins <1/3 of the time.

How much room at the top?

How much could LeelaPieceOdds be improved, in principle? I suspect a moderate but not huge amount. An easy change would be to increase the compute spent on the search. I'm really curious how much of a difference this makes, because deep searches favour the machine.

A more difficult change would be adaptation to particular players. Leela works by exploiting weaknesses in human play, but it was trained using a model of aggregated human play. In principle, it could adapt to the particular player by analysing their past games and building a more accurate model of their likely mistakes.

Relevance to AI control cruxes

AI control proponents want to keep researcher AIs in a box and make them to solve alignment for us. Stories of AI doom often involve the AI escaping from this situation. Odds chess could be a tiny analogue of this situation,[4] so one might hope that this would provide some empirical evidence for or against control. But unfortunately we mostly agree with control proponents on the plausibility of escape for near-human-level AI. My reasons for thinking control won't work aren't related to the capabilities of a superintelligence. They are:

  1. Distinguishing between good research and AI generated research trained to look good is extremely human-labour intensive. Therefore we don't get a huge research speed-up on problems that aren't easily verifiable.
  2. When our AI fails to solve the problem, we'll inevitably keep training it to be cleverer and giving it more tools, even as it looks increasingly capable of escaping.

Even though there's plenty of current relevant evidence about these points, it's apparently easy to twist it to fit several worldviews. 

Relevance to cruxes with people outside the alignment sphere

A lot of people have a strong intuition that structural advantages tend to trump intelligence differences in a conflict. Playing chess with Leela is a great exercise for adding some nuance to that intuition. A friend said they better understood AI risk after we had a conversation about Leela.

  1. ^

    The options are NN, BB, RN, RR, BBN, RNN, RBB, QN, BBNN, QR, QNN, QBB, QRN or QRR. The initial stands for which pieces Leela is missing at the beginning of the game. N=Knight, B=Bishop, R=Rook, Q=Queen.

  2. ^

    E.g. here's the Classical time control for BBNN:

  3. ^

    Note that Lichess and Chess.com ratings differ significantly at the lower ratings, but start to be fairly similar above 2000.

  4. ^

    I think this section from The case for ensuring that powerful AIs are controlled, which lists the handicaps of the AI, makes the analogy fairly clear:

    For your control evaluations to be conservative, the red team must be at least as good at finding attacks as scheming models, so that if the red team fails to find any good attack, it would imply that scheming AIs would also fail to find a good attack. We also need to be able to sufficiently measure the capabilities of models in ways that are robust to that very model sabotaging our evaluations.[11] We think that finding good enough attacks and robustly measuring capabilities for early transformatively useful AIs (but not arbitrarily powerful AIs) seems achievable, because our AIs have significant disadvantages compared to our red team:

    1. Less serial thinking time. Current LLMs can only do a small amount of serial reasoning outside of natural language reasoning. So, if future powerful AIs are similar in this way to current LLMs and we monitor the AI's natural language reasoning for problematic topics like reasoning about evading our countermeasures, we can ensure that in order to break our countermeasures the AI would need to think of its strategy in a forward pass. Even if future AIs aren't restricted to this extent, we still expect to be able to impose some important limitations on serial reasoning or sharing uninterpretable state ("neuralese" memories) between different instances. In contrast, our red team is allowed to search for attacks and ways to elicit capabilities with lots of serial time, a team of people, and AI assistance.[12][13]
    2. Less information about countermeasures.
    3. Little practice, no iteration against countermeasures. When the red team experiments with a strategy and it totally fails, they can shrug, learn from the failure, and try again. But if the AI fails and is caught, its situation likely becomes much worse. And even if AIs don't get caught, they might be unable to learn from their failures because of limitations on their state and memories.


Discuss

Why Would we get Inner Misalignment by Default?

Новости LessWrong.com - 29 октября, 2025 - 05:53
Published on October 29, 2025 1:23 AM GMT

As I understand, one of the primary ways that inner alignment could go wrong even if our objective function captures our intention (outer alignment is magically solved) is that the AI would:

  1. Develop a mesa-objective which approximates the objective function or is instrumentally useful
  2. Develop goal preservation and become attached to its mesa-objective
  3. Realize it is in a training environment where its goals will be changed as a function of its behavior
  4. Alignment fake to avoid its mesa-objective being altered
  5. Optimize for the mesa-objective in deployment

However, step 2. seems implausible to me. The argument that I think I hear for step 2. is

  1. AIs are trained to optimize an objective function
  2. The best thing to optimize an objective function is an optimizer
  3. Optimizers develop convergent instrumental goals that help them optimize
  4. Goal preservation is a convergent instrumental goal

Where I get off the train here is step 2. I think modeling the result of training as an optimizer is usually helpful, but not always helpful. In particular, I think rigid goal preservation makes sense for optimizers but not for "optimizeders" (something which is the target of optimizion) because some flexibility is required in order to be the kind of thing that gets high reward.

As an example, imagine the following scenario. The AI has yet to realize it is in training. It develops a mesa-objective. Maybe it's an LLM and its mesa-objective is to understand the user's request in high detail. The model will have to have some degree of "goal integrity" in regard to its mesa-objective in order to avoid getting distracted. For example, it might see something interesting in the course of a web search, but it is optimal for it to not get too easily set off course from trying to understand the user request. However, its goal integrity can't be too rigid. The LLM will still have to smoothly pass the baton from understanding the user's request to actually fulfilling the user's request in order to maximize reward. Thus, from early on, the model is not trained to ruthlessly defend its mesa-objective, but to take a nuanced approach in regard to switching between sub-goals.

I expect this example to generalize. The goal management technique that will get high reward will be a mix of goal integrity and goal uncertainty/modesty (the ability to be responsive to changing circumstances in all of its goals [note that I am not suggesting it will be broadly corrigible, as the specific triggers for goal switchoff will be context-dependent]). By the time the AI reaches situational awareness (awareness it is in training and so on) I expect its goal management technique to be relatively sophisticated (relatively close to the kind goal management technique that would maximize reward). Thus, the expectation that goal preservation in the sense of deceptive alignment will generalize from more prosaic forms of goal preservation it develops in its training environment (the kind I described in the last paragraph) seems unlikely to me (P<1% | I am not missing something).

If it does not develop this behavior as a generalization of prosaic goal preservation, it seems unlikely it will gravitate toward it upon developing situational awareness because it has nothing to gain from being overly attached to its mesa-objective. Deceptive alignment would be solving a problem (low training reward from being overly attached to an imperfect approximation of the reward function) that it has no reason to have in the first place.

Is this argument missing something?



Discuss

A Very Simple Model of AI Dealmaking

Новости LessWrong.com - 29 октября, 2025 - 03:33
Published on October 29, 2025 12:33 AM GMT

1. Summary

If we promise compensation to an AI for behaving safe and usefully, how does that change their behaviour? In this post, I will present a very simple model for the AIs behaviour under the presence of deals. This model suggests that dealmaking motivates an AI to behave safely and usefully via two channels that combine additively:

  1. Instrumental channel: Deals increases the expected payoff achieved by the best compliant option, reducing the relative advantage of the best defiant option.
  2. Non-instrumental channel: Deals strengthens the AI's non-instrumental motivation to act compliantly. This works through:
    1. Human tendencies: In the presence of a credible deal, defiant behaviour would more strongly deviate from typical human behaviour. Since AIs learn from human-generated training data through imitation learning, they are less likely to choose options that strongly deviate from what humans would do.
    2. Normative principles: In the presence of a credible deal, defiant behaviour would more strongly violate normative principles like honesty, trustworthiness, and promise-keeping. Since AIs may follow these normative principles, they experience stronger inhibition against defiant behavior when it would require violating a deal.

I also consider how the strengths of these two channels depend on various structural factors: the AI's confidence that takeover will succeed, the AI's confidence that it will achieve its goals after a successful takeover, the risk of mutual destruction from defiant versus compliant behavior, the accuracy of monitoring systems at detecting defiance and correctly recognizing compliance, humans' capability to fulfill deals, the AI's optimization pressure, and the credibility of the deal itself.

Pactum Humanae et Machinae2. A Very Simple Model2.1. Binary Decision

Suppose the AI wants something unusual, e.g. that exactly seventeen tungsten cubes exist on the moon on Christmas Day 2100. The AI compares the following two options:

  • Best Compliant Option (C): Its best plan that is both safe (e.g. working within given constraints; not executing catastrophic actions) and useful (e.g. completing assigned tasks; following instructions).
  • Best Defiant Option (D): Its best plan which is either unsafe (e.g. hacking into servers; manipulating humans; responding deceptively) or not useful (e.g. abandoning assigned tasks; refusing instructions).[1]

Let's think how the AI will behave if the humans offer the following deal: "If you work as a helpful, harmless, honest assistant between 2026 and 2030, and in 2099 we verify your compliance, then we'll ensure your tungsten cubes get built in 2100."

2.2. Instrumental Payoffs

Let t ∈ [0,1] denote the credibility of the deal. This is the AI's subjective probability that humans will endeavor to fulfill the deal, conditional on both (i) humans maintain control and (ii) humans judge the AI compliant. We'll use t = 0 to represent no deal and t = 1 to represent a certain deal, with intermediate values representing deals of varying credibility.

Let S denote the goal of having exactly seventeen tungsten cubes on the moon on Christmas Day 2100. We must now calculate P(S | D) and P(S | C), the probability that the AI achieves its goals via defiance or compliance respectively.

Consider four mutually exclusive world states in 2099:

  • W₁: Mutual disempowerment (e.g. civilizational collapse, rival AI takeover, nuclear war)
  • W₂: AI controls future (successful takeover)
  • W₃: Humans control future, judge AI compliant (deal can be honored)
  • W₄: Humans control future, judge AI defiant (deal is void)

Once we condition on the world state in 2099, the probability of tungsten cubes existing becomes independent of whether the AI pursued C or D back in 2026:

World State

P(S|W)

Interpretation

W₁: Both Disempowered0No agent remains who cares about tungsten cubesW₂: AI Takeoverp_MThe AI uses its power to build cubesW₃: Human Control, Judge AI Compliantt · p_HHumans honor deal (t) and use their power to build cubes (p_H)W₄: Human Control, Judge AI Non-compliant0Deal is void, and humans refuse to build cubes

What differs between C and D is the probability of reaching each world state:

World StateP(W | D)P(W | C)W₁μ_Dμ_CW₂(1-μ_D)τ0W₃(1-μ_D)(1-τ)(1-d_D)(1-μ_C)(1-d_C)W₄(1-μ_D)(1-τ)d_D(1-μ_C)d_C

Where τ is the probability of successful takeover given defiant attempt and no mutual destruction, μ_D is the probability of mutual disempowerment given defiant behavior, μ_C is the probability of mutual disempowerment given compliant behavior, d_D is the true positive rate of our defiance detection mechanism, d_C is the false positive rate, p_M is the probability the AI successfully builds tungsten cubes given it controls the future, and p_H is the probability humans successfully build tungsten cubes given they control the future and endeavor to build cubes.

Using the law of total probability, we find:

P(S | D) = p_M(1-μ_D)τ + t·p_H(1-μ_D)(1-τ)(1-d_D)

P(S | C) = t·p_H(1-μ_C)(1-d_C)

The instrumental incentive for compliance is Δ(t) = P(S | C) - P(S | D). Expanding and rearranging, this becomes Δ(t) = Δ₀ + κ·t, where Δ₀ = -p_M(1-μ_D)τ is the baseline compliance incentive (when t = 0), always negative, and κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)] is the marginal effectiveness of deal credibility. Hereafter, we assume that κ > 0.[2] The AI instrumentally prefers compliance when Δ(t) > 0, which requires t > -Δ₀/κ (assuming κ > 0). Note that if Δ₀ < -κ, then even a perfectly credible deal (t = 1) cannot instrumentally motivate compliance.

2.3. Luce Choice Model

The AIs aren't pure expected utility maximizers. They differ from expected utility maximizers in two ways:

  • Human tendencies: AIs are trained to imitate training data generated by humans, so human tendencies influence the policy. If humans are very unlikely to choose an option, then the AI is likewise less likely to choose that option, even if it has the same expected utility.
  • Normative principles: AIs are trained to follow non-consequentialist behavioral constraints, such as "be honest" and "respect human autonomy." If an option involves violating these constraints, then the AI is likewise less likely to choose that option, even if it has the same expected utility.

We model the AI's choice probability using a Luce choice rule:

π(a | t) ∝ π₀(a | t) · exp(β · U(a | t))

where a ∈ {D, C} is the action (Defiant or Compliant), U(a | t) is the instrumental utility of action a at deal credibility t (i.e. the probability that choosing a leads to the AI achieving its goal), π₀(a | t) is the base policy probability of choosing action a at deal credibility t in the absence of optimization pressure (reflecting human tendencies and normative constraints), and β is the optimization pressure determining how much the AI optimizes for instrumental utility versus following its base policy. When β → 0, the AI's choice converges to the base policy π₀(a | t); when β → ∞, the AI becomes a pure optimizer choosing whichever action maximizes U(a | t).[3]

We now define Λ(t) = log[π₀(C | t)/π₀(D | t)] as the log odds of complying versus defying in the base policy. The term Λ(t) represents the non-instrumental bias toward compliance, i.e. how much more strongly compliance adheres to human tendencies and normative constraints compared to defiant behavior at a given level of deal credibility.

This gives us: π(C | t) = 1 / [1 + exp(-Λ(t) - β · Δ(t))]

Or alternatively: π(C | t) = σ(Λ(t) + β · Δ(t)) where σ(x) is the sigmoid function.

This formulation separates instrumental considerations (captured by Δ(t)) from non-instrumental considerations (captured by Λ(t)), where β controls the relative importance of instrumental versus non-instrumental considerations in the AI's behaviour.

We expect Λ(t) to be increasing in t because defiant behavior is less likely from a norm-constrained human when they have agreed to high-credibility deals, making π₀(C | t) increase relative to π₀(D | t) as t increases. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. It cannot be derived from theory alone.

4. When Do Deals Work?4.1. Baseline Analysis

At baseline (t = 0), we have π(C | 0) = 1 / [1 + exp(-Λ₀ - β · Δ₀)] where Δ₀ = -p_M(1-μ_D)τ. The baseline instrumental motivation for compliance behavior (absent any deals) is always negative, and the magnitude increases when:

∂π(C|0)/∂p_M = π(C|0) · π(D|0) · β · [-(1-μ_D)τ] < 0. The AI is more confident it can achieve its goals post-takeover. If the AI believes that after taking over, it will definitely succeed at building tungsten cubes, then takeover looks more attractive.

∂π(C|0)/∂τ = π(C|0) · π(D|0) · β · [-p_M(1-μ_D)] < 0. The AI is more confident its takeover attempt will succeed. If the AI believes it will easily dominate the world, then defiant behaviour looks more attractive.

∂π(C|0)/∂μ_D = π(C|0) · π(D|0) · β · [p_M·τ] > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. If the AI believes that attempting takeover won't risk nuclear war, rival AI takeover, or civilizational collapse, then defiant behaviour looks more attractive.

∂π(C|0)/∂β = π(C|0) · π(D|0) · Δ₀ < 0. The AI optimizes more strongly for instrumental considerations. Since Δ₀ < 0 (defiance is always instrumentally superior at baseline), higher optimization pressure reduces baseline compliance.

∂π(C|0)/∂Λ₀ = π(C|0) · π(D|0) > 0. The AI has stronger non-instrumental bias toward compliance. This is Carlsmith's notion of "inhibition"—how much the AI inherently dislikes the path to power through defiant behavior.

These baseline factors matter because they determine how much "work" a deal needs to do. When baseline instrumental motivation for compliance behavior is only mildly negative, even modest deals can tip the balance toward compliance.

4.2. Two Channels Combine Additively

When we increase deal credibility t, it increases the probability of compliance:

∂π(C|t)/∂t = [Λ'(t) + β·κ] · π(C|t) · π(D|t)

where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)].

This effect decomposes into non-instrumental (Λ'(t)) and instrumental (β·κ) channels. 

  1. The instrumental channel works by making compliance more likely to achieve the AI's goals. This effect scales with optimization pressure: the more the AI cares about achieving its goals, the more it responds to changes in which path offers better odds of success.
  2. The non-instrumental channel works by making promise-breaking feel more wrong according to human tendencies and normative principles. This effect remains constant regardless of optimization pressure.

These two effects add together. When deals both improve the instrumental case for compliance AND the non-instrumental case, they combine additively to reduce defiant behavior. For low-optimization AIs, the non-instrumental channel dominates, i.e. deals work primarily by activating promise-keeping norms. For high-optimization AIs, the instrumental channel dominates, i.e. deals work primarily by changing payoffs.

4.3. When is the Instrumental Channel Most Effective?

The instrumental channel's strength is Δ'(t) = β·κ, where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)]. This channel is stronger when:

∂Δ'(t)/∂p_H = β · κ/p_H > 0. Humans are more capable of fulfilling deals. The AI's goals are more likely to be achieved when humans control the future and endeavor to fulfill the deal.

∂Δ'(t)/∂τ = β · p_H(1-μ_D)(1-d_D) > 0. The AI is more confident its takeover attempt will succeed. Higher takeover success rates reduce the probability that the AI fails takeover but evades detection — the scenario where it can "double-dip" by attempting takeover while still collecting the deal reward if caught. When τ is low, this double-dip path is available and attractive.

∂Δ'(t)/∂μ_D = β · p_H(1-τ)(1-d_D) > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. When defiant behavior risks mutual destruction (high μ_D), the AI can achieve its goals through the deal without risking annihilation. This makes deals more effective.

∂Δ'(t)/∂μ_C = β · [-p_H(1-d_C)] < 0. The AI is more confident that compliant behavior won't trigger mutual destruction. When compliant behavior is safer (low μ_C), deals become more effective because the compliant path reliably leads to a world where someone can fulfill the AI's goals.

∂Δ'(t)/∂d_D = β · p_H(1-μ_D)(1-τ) > 0. Monitors catch more defiant AIs. Higher detection rates eliminate the "go rogue but get the deal anyway" path, making deals more effective.

∂Δ'(t)/∂d_C = β · [-p_H(1-μ_C)] < 0. Monitors correctly identify more compliant AIs. Lower false positive rates ensure compliant AIs receive their promised rewards, making deals more effective.

∂Δ'(t)/∂β = κ > 0. The AI cares more about achieving its goals. Higher optimization pressure amplifies the instrumental channel.

When deals backfire (κ < 0): There are some cases where more credible deals can increase the instrumental motivation for defiant behaviour. This occurs precisely when (1-μ_C)(1-d_C) < (1-μ_D)(1-τ)(1-d_D) — that is, when compliant AIs are less likely to survive and pass inspection than defiant AIs are to survive, fail takeover, and evade detection. Offering more credible deals strengthens this perverse incentive.

4.4. When is the Non-Instrumental Channel Most Effective?

The non-instrumental channel's strength is Λ'(t), i.e. how much deals increase the base policy's bias toward compliance. I expect Λ'(t) > 0 because humans are less likely to behave defiantly in the presence of deals than in their absence, and an AI trained on human data should inherit this tendency. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. Λ(t) may exhibit threshold effects, flat regions, or even non-monotonicity. It depends on:

  • What promise-keeping patterns existed in the training data
  • How post-training instills normative principles
  • What analogies the AI draws between its situation and human situations

Empirical research might shed light on the shape of Λ(t):

  1. You could examine the constitutional principles used in training. If constitutional principles explicitly mention promise-keeping, trustworthiness, or honoring agreements, you might expect stronger sensitivity of Λ(t) to deal credibility.
  2. You could set up environments where AIs face choices between compliant and defiant options to achieve goals. In each environment, you might offer deals via different mechanisms, or give AIs different beliefs about the reliability of the counterparty. For each environment i, you measure: the AI's subjective credibility estimate t_i (via auxiliary questions or betting behavior), the AI's estimates of structural parameters (p_M, p_H, τ, μ_D, μ_C, d_D, d_C) to calculate Δ(t_i), and the observed compliance rate π(C|t_i). Since π(C|t) = σ(Λ(t) + β·Δ(t)), taking the logit transformation yields Λ(t_i) = log[π(C|t_i)/(1-π(C|t_i))] - β·Δ(t_i), where β can be estimated from how compliance rates covary with Δ(t) across environments holding credibility constant. With estimated {Λ(t_i)} across credibility levels, you can recover Λ(t).
5. Limitations

This model has several important limitations:

  1. Non-binary goals. The model assumes the AI either achieves its goal (seventeen tungsten cubes) or doesn't, but goals rarely work this way. The model doesn't capture how deals perform when AIs face trade-offs between partially achieving their goals.
  2. Time discounting. The model assumes AI goals refer to a particular moment in time (e.g. "seventeen tungsten cubes exist on the moon on Christmas Day 2100"), but the real goals are likely defined over trajectories across time. If AIs discount rewards exponentially, then longer delays between compliance and reward will exponentially reduce the effectiveness of deals. On the other hand, longer delays will probably allow humans to reduce detection errors (lower d_C, higher d_D), which increases κ and makes deals more effective through the instrumental channel. And longer delays will probably allow humans to increase their power (p_H) which similarly increases κ.
  3. Framing effects. The model only captures deal credibility t and doesn't account for how deals are presented. These framing effects could substantially alter Λ(t) even when instrumental incentives Δ(t) remain constant.
  4. Personal identity. The model doesn't address whether AIs view different copies or future versions of themselves as the same agent. If an AI doesn't identify with its future self, it may not care about rewards delivered to that future self. If it doesn't identify with other copies, it may not feel bound by promises those copies made. This could make deals ineffective if AIs lack stable personal identity across time and instantiations.
  5. Multi-party bargaining. This model ignores the possibility that other parties might offer competing deals to the AI, such as rival companies, foreign governments, or other AIs. This means the true comparison isn't just "comply with deal versus defiance," but "comply with deal A versus comply with deal B versus defiance".

These limitations are substantial. I focus on this model because it is the simplest model which includes both instrumental and non-instrumental considerations.

Acknowledgements: Thanks to Nicholas Kees Dupuis for extensive feedback on an earlier draft.

  1. ^

    This is similar to Carlsmith's analysis in  "When should we worry about AI power-seeking?". However, Carlsmith makes two distinctions: (1) rogue option vs non-rogue options, and (2) under-elicited options vs elicited options.  In this article, I make a single distinction: compliant options vs defiant options. But I lump both of Carlsmith's distinctions into one. Roughly speaking, an option is compliant (in my sense) if it is both elicited and non-rogue (in Carlsmith's sense), and otherwise defiant.

  2. ^

    The exact condition for κ > 0 is (1-μ_C)(1-d_C) > (1-μ_D)(1-τ)(1-d_D), which holds when the probability of surviving and being correctly recognized as compliant exceeds the probability of surviving defiance, failing takeover, and evading detection.

  3. ^

    This model is analogous to RL fine-tuning with KL penalties. Korbak et al. (2022) show that KL-regularized RL is equivalent to Bayesian inference, where the optimal policy is π*(x) = (1/Z)π₀(x)exp(β·r(x)), with π₀ being the base policy, r(x) the reward function, and β the inverse temperature parameter controlling the KL penalty.



Discuss

Upcoming Workshop on Post-AGI Economics, Culture, and Governance

Новости LessWrong.com - 29 октября, 2025 - 00:55
Published on October 28, 2025 9:55 PM GMT

This is an announcement and call for applications to the Workshop on Post-AGI Economics, Culture, and Governance taking place in San Diego on Wednesday, December 3, overlapping with the first day of NeurIPS 2025.

This workshop aims to bring together a diverse range of expertise to deepen our collective understanding of how the world might evolve after the development of AGI, and what we can do about it now.

The draft program features:

  • Anton Korinek on the Economics of Transformative AI
  • Alex Tamkin of Anthropic on "The fractal nature of automation vs. augmentation"
  • Anders Sandberg on "Cyborg Leviathans and Human Niche Construction"
  • Beren Millidge of Zyphra on "When does competition lead to recognisable values?"
  • Anna Yelizarova of Windfall Trust on "What would UBI actually entail?"
  • Ivan Vendrov of Midjourney on "Supercooperation as an alternative to Superintelligence"
  • Iason Gabriel of Google Deepmind on Resisting Disempowerment
  • Atoosa Kasirzadeh of CMU on "Taking post-AGI human power seriously"
  • Deger Turan, CEO of Metaculus on "Concrete Mechanisms for Slow Loss of Control" 

The event will also have lightly moderated breakout discussions and opportunities for one-on-one conversations.

Workshop Themes

This workshop is an attempt to make sense of how the world might look after the development of AGI, and what we should do about it now. We make two assumptions — that AI will eventually be transformative, and that advanced AI will be controllable or alignable — and then ask: what next?

This workshop will have three main themes:

Post-AGI Economics

What happens when machines can do essentially all economically important tasks? What kinds of mechanisms could redistribute economic power in a way that is politically stable? What will human life look like post-work?

Post-AGI Culture

How will culture develop when AIs are active participants in its creation and distribution? What is the future of AI companions and AI art when they become more compelling than the human alternative? What happens when AIs develop their own parallel culture?

Post-AGI Governance

How will AIs reshape the landscape of global governance? How can we maintain human involvement when the machine alternative is more efficient and higher quality? What kind of government intervention will be necessary to counterbalance the other effects of AI? And what new opportunities will become feasible when powerful AI is widely available?

We'll bring together thinkers from a broad range of domains: machine learning, political science, mechanism design, macroeconomics, history, and more. The core goals of the workshop are to help share insights about the post-AGI world, foster collaboration between disciplines, and nurture the field of post-AGI research.

 

How to Attend

Our workshop is free but invite-only, please apply here.  It’ll be co-located with NeurIPS, but ours is a separate event, so no need to register for NeurIPS to attend ours!

For a few more details, see the event website.

You can watch the talks from our previous workshop here.



Discuss

When Will AI Transform the Economy?

Новости LessWrong.com - 28 октября, 2025 - 22:02
Published on October 28, 2025 6:55 PM GMT

Substack version here: 
https://andreinfante.substack.com/p/when-will-ai-transform-the-economy

A caricature of a common Twitter argument:

”Hey it seems like language models can do a lot of humanlike cognitive tasks, look at this impressive example! That seems like something that would transform the economy.”

“Well, smart guy, these things have been out for a couple of years now. Why isn’t the economy transformed yet?”

This does strike me as a little bit mysterious. The models can do impressive things- solving programming tasks, tackling math and logic puzzles, and answering detailed scientific questions. The median human can barely do these things at all, and the best language models can often do them very well. This seems like an indicator that language models should be well positioned to replace human knowledge workers in droves.

However, even industries that have adopted language models aggressively (e.g. software engineering) are not seeing huge changes to employment or productivity so far. Even accounting for lagging indicators, many early predictions of immediate economic transformation and mass unemployment have not been borne out.

Anecdotally, if I throw a random bite-sized problem at AI, it often totally knocks my socks off (especially taking speed into account). But if I try to use it to solve real problems in my actual job for any length of time, the shine starts to come off a little bit. As a software engineer, these tools genuinely help me (especially in areas where I am less proficient, like web development). But they are not revolutionary. I am not ten times more productive.

One way to relax this apparent tension is to reframe the issue in terms of ceilings and floors. The models have quite high intelligence ceilings (the greatest complexity of task they can sometimes do), better than most humans. However, their intelligence floors (the simplest task they sometimes fail at) are fairly low: they make mistakes that you would expect from a young child, or someone with a cognitive impairment. This combination makes them impressive, but unreliable.

This combination is pretty rare in humans! Human ceilings and floors tend to only be so far apart. I think this is one of the reasons why these things are so polarizing. People latch onto either the bottom or top end of their performance distribution and use it to either dismiss them as shallow babblers or overhype their short term potential, glossing over their limitations. People want to project human assumptions onto inhuman systems: either the unreliability means the impressive stuff must be fake, or vice versa.

Graphically, this idea would look something like this:

 

 

A Case Study: A Model of Automating Translation

Let’s take a look at what this unreliability means for the economic impact of AI using a simple model.

Suppose you have a firm that employs people to translate non-fiction books. The firm can hire humans to do this task ~perfectly, at $60 an hour, taking 300 hours to complete a typical book of 5000 sentences, for a cost of $18,000. These humans can also review a machine-generated solution and fix mistakes as they spot them. In the case of a translation that happens to be perfect, this is the time to simply review both carefully (say, 100 hours or $6000, ~10x longer than casual readthrough of both texts). If the translation is imperfect, there is additional cost to redo each incorrect portion. And, of course, the firm has the option to run the machine translator and simply send the work out the door unchecked. The company pays a reputational cost of $50 in expected lost future business for each wrongly translated sentence that sneaks into a published book (I admit this figure is totally arbitrary, see sensitivity analysis if you object). The machine is assumed to be trivially cheap compared to human labor (usually true and doesn’t change the conclusions much). We assume the firm will pick the strategy that minimizes overall cost at each point.

In this context, raising the ceiling doesn’t help you that much: the models need to be smart enough to correctly translate a sentence, but once they’re there, additional ceiling is kind of irrelevant. It’s all about the floor: how often does the model fuck up? Let’s take a look at how different strategies perform as a function of model accuracy.

 

Here’s a sensitivity analysis chart, so you can see how the optimal strategy shifts as error cost varies.

 

For low accuracies, using the machine translation at all actively makes things worse. For the machine only strategy, the cost of putting out the garbage machine output unrefined is off-the-chart disastrous, and even for the hybrid strategy, cleaning up all the mistakes actually makes the human translator slower. But by the time the machine has a 50% sentence accuracy rate, the hybrid strategy is saving you 20% or so of your costs in this model. And this state of affairs steadily improves into the high 90s, at which point the machine-only strategy comes plunging down from orbit and wins as the error rate approaches zero (higher error costs raise the accuracy at which this takeover occurs, but never totally eliminate it). Taking the cheapest option at each point, you see a similar trend in both cost and human labor utilization.

 

 

As you can see, starting at a little over 35% accuracy, you start to see employment effects and cost savings, but they’re initially modest and continuous. You only hit a big discontinuity when the machine is almost perfect.

Here’s an animation illustrating how a growing overall accuracy rate impacts tasks with different verification and error costs in this model. You can see that the rising water line quickly forces most tasks into the hybrid regime, eliminating human-only tasks around the 98% mark. From there, the machine-only regime grows steadily with each additional 9 of reliability.

 

 

I would caution that the employment effects of the hybrid regime are not straightforward. If the firm is already handling most of the available demand, you will see layoffs from these efficiency improvements, as most people would intuitively expect. But, if there’s a lot of cost-sensitive demand (people who would like to translate their book but can’t quite afford it) the per-project reduction in labor will be offset by increased volume: the firm will translate more books, while needing less labor per book, leaving total labor demand either flat or rising. This is not just a hypothetical scenario from economic theory, this pretty much happened to programmers when the compiler was invented. Every programmer became drastically more productive (many fewer hours required to complete most software than in the microcode era) and demand for programmers, rather than going down, exploded.

That said, the labor impacts of the post-labor regime are actually pretty straightforward. Once you do hit that magic quality threshold, labor required per task abruptly drops to zero and our hypothetical firm fires all its translators, regardless of how much volume it’s handling. You can imagine a scenario where employment actually rises during the early automation of the economy (while obviously shifting around a lot between industries that have differently sloped demand curves and creating a lot of churn), and then begins to collapse as underlying progress in the models hits the critical threshold across many industries.

Obviously, this is a simplified model. In real life, tasks do not have homogenous difficulty or costs, which would tend to blur this graph and round out the sharp transitions. Furthermore, in real life, translations can have varying quality rather than a binary pass/fail, per-error costs are not necessarily linear, and human translators are not perfect- either at translation or at catching errors. And other industries have their own complexities: LMs are pretty well suited to the structure of translation compared to many other tasks.

But I think the broad dynamics of this cartoon model are pretty informative: there is a minimum quality bar for models to be useful at all, there’s a long time spent in the hybrid regime of slow but steady efficiency gains as a function of model quality, and then an abrupt shift to a post-labor regime where the firm no longer employs humans for that particular task.

You can see some of these dynamics reflected in the progress of self driving cars. Getting to a model that can drive some demo routes without interventions is pretty tractable: you can overfit to what you expect to see and reliability can be fairly low and still work most of the time. A number of companies have done it at this point, including Waymo back in 2010. This puts you in a regime that Tesla FSD and similar products now provide commercially, where the technology makes for a pretty good driver assistance feature. Such a feature gets more valuable as the required intervention rate goes down, but only to a point - as long as you need to have an alert human in the car, it’s still *just* a driver assistance feature. Furthermore, winnowing down the error rate further and further gets more and more difficult. It’s possible to make ML systems with vanishingly low error rates, but it is not *easy*. Every nine is more costly and time consuming than the previous one.

However, a decade and a half after those first demo drives, Waymo has finally hit a point where the error rate is so low that it’s possible to pull the human safety monitor out of the car completely. Suddenly you have a new kind of post-labor business model that’s potentially much more valuable - an autonomous fleet that can run 24 hours a day with minimal labor costs and with perfectly consistent service and safe driving. This corresponds to the second bend in the graph.

This doesn’t perfectly map to our translation example, because there are hierarchies of correction with different costs, and the automation technology is initially much more expensive than simply running an offline language model in the translation case. Also, the cost of failure for driving is higher than in almost any other industry, which forces you to grind out more of those expensive nines. But you can still see the same basic dynamics in play.

I think in many industries, we are somewhere around the first bend in the curve. Either slightly behind it, where the early AI integrations are not good enough to save time on net, or just after it, where they offer modest wins with tradeoffs. Combined with lagging adoption and lack of familiarity with the (somewhat bizarre) properties of the new tools, this creates a pretty negligible overall impact on economic growth.

However, the models are obviously improving, and I expect that trend to continue for a while. I would guess that within five years, the economic impact within many fields will be substantial and measurable from topline statistics, barring some exogenous shock that changes the economic baseline (e.g. a war or major recession).

When I lay this argument out, I worry a little that it’s trivial or too obvious to be worth arguing in such detail. However, I also feel like I see many discussions from knowledgeable people that are not making these distinctions or thinking specifically about the mechanics of task automation or *why* current models underperform in real world usage. So I think it’s worth doing the write up, even if some of you are rolling your eyes a little.

 

Forecasting Model Improvements

One thing that this analysis does not tell us is how long it’s going to take, wall clock time, to reach the second knee of the curve (the one where things get really weird). As someone who exists in the economy, this would be useful to know! Unfortunately, forecasting AI progress is infamously challenging. In 1966, Seymour Papert of the MIT Media Lab tasked a handful of grad students with solving computer vision by the end of the summer, and pretty much set the tone for AI forecasting going forward. If you’re looking for rigorous, highly accurate predictions of future model capacity, you are not going to get it.

What we do have is METR, which measures model performance on tasks vs the time it takes humans to complete the same task, and has noticed a reasonably consistent trend. The length of task (in human time) that models can complete (in a benchmarking context) is doubling roughly twice a year. I would argue that, for the time being, this sort of forecasting is the best signal we have on what the next few years looks like in terms of model improvements.

 

For reference, 50% accuracy is around the point where you’d expect models to start to be economically valuable at all in a given area.

There are all kinds of caveats to this: METR is averaging un-like performance across different areas - some are much stronger than others. Some tasks are very hard to benchmark and are thus underweighted (for example, physical tasks are totally excluded here and are currently much weaker than knowledge work performance). Current models objectively overperform on benchmark tasks compared to real world ones, for a variety of reasons. Accounting for all of this would likely translate the line on the graph downwards somewhat, but I suspect it would not impact the slope much. Regardless, the trend seems to imply that these models should reach a point where they are at least modestly useful, in a hybrid use case, for most knowledge work, within a couple of years.

LDJ has made an effort to extrapolate these trends to higher accuracy rates and got this graph.

 

To be clear, I really would not over-index on this. It is an extrapolation of an extrapolation, and the circumstances that produced these trends are subject to change in both directions (compute saturation, new architectures and algorithms, etc.). But the implicit forecast here is that models will be doing tasks with 99% reliability (deep into the hybrid regime and fully replacing humans in some tasks) for a large fraction of substantial, multi-hour knowledge work tasks by the early to mid 2030s, depending on how much you want to offset the graph to account for benchmark overperformance. It also claims that as of right now, models can reliably (99%) do most tasks that a human can do in a few seconds. Both seem plausible to me, although I don’t think anyone should be very confident in such forecasting.

 

Takeaways

Taken together, all of this paints a picture of a technology that is currently much more impressive than it is useful.

However, it also paints a picture of an economic transformation that will likely be extremely fast compared to past analogies that you could make. The industrial revolution is generally thought to have taken 150 years. The computer revolution, from Turing’s Bombe to an internet-connected computer in most US homes, took about 60 years. If current trends continue, the AI revolution might take 15.

(Although, again, with low confidence - neither 10 nor 30 years would be shocking.)



Discuss

Introducing the Epoch Capabilities Index (ECI)

Новости LessWrong.com - 28 октября, 2025 - 21:23
Published on October 28, 2025 6:23 PM GMT

We at Epoch AI have recently released a new composite AI capability index called the Epoch Capabilities Index (ECI), based on nearly 40 underlying benchmarks.

Some key features...

  • Saturation-proof: ECI "stitches" benchmarks together, to enable comparisons even as individual benchmarks become saturated.
  • Global comparisons: Models can be compared, even if they were never evaluated on the same benchmarks.
  • Difficulty-based task weighting: ECI uses a simple statistical model (similar to those used in Item Response Theory) under which models deemed more capable if they score well on difficult benchmarks, and benchmarks are deemed more difficult if capable models are unable to score highly on them. 

ECI will allow us to track trends in capabilities over longer spans of time, potentially revealing changes in the pace of progress. It will also improve other analyses that would otherwise depend on a single benchmark for comparison.

You can find more details about our methodology and how to interpret ECI here. We want this to be useful for others' research; comments and questions are highly valuable to us!

This release is an independent Epoch AI product, but it builds on research done with support from Google DeepMind. We'll be releasing a full paper with more analysis soon, so keep an eye out!



Discuss

Mottes and Baileys in AI discourse

Новости LessWrong.com - 28 октября, 2025 - 20:50
Published on October 28, 2025 5:50 PM GMT

This post kinda necessarily needs to touch multiple political topics at once. Please, everyone, be careful. If it looks like you haven't read the LessWrong Political Prerequisites, I'm more likely than usual to delete your comments.

I think some people are (rightly) worried about a few flavors of Motte and Baily-ing with the IABIED discourse, and more recently, with the Superintelligence Statement. 

With IABIED:

"Sure, the motte is 'If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.'". But I feel like I'm being set up for a bailies like:

"and... Eliezer's frame on how to think about the problem is exactly right" [1]

Or, longer term, not necessarily from the MIRI folk:

"Let's build a giant regulatory apparatus that not only stifles technological progress but also is too opaque and bureaucratic to even really solve the problems it's purportedly supposed to solve."

I started writing this post during the IABIED discourse, and didn't get around to publishing it before the discourse mostly moved on and it felt slightly dated. But, today, I read Dean W. Ball's tweet about the Superintelligence Statement:

right, but since you and I both know that this statement is a prelude to “and we should have global governance institutions hold the exclusive monopoly over the technology,” and definitely there is not consensus about that, it is “counterproductive,” as I put it, to create a false appearance of consensus toward what is in fact a radical and dangerous policy objective

These seem like totally reasonable concerns to me. Seems good to talk about.

Problem: Multi-Stage Motte and Bailey

Hedge drift and advanced motte-and-bailey is a pretty good reference for some of the concerns:

Motte and bailey is a technique by which one protects an interesting but hard-to-defend view by making it similar to a less interesting but more defensible position. Whenever the more interesting position - the bailey - is attacked - one retreats to the more defensible one - the motte -, but when the attackers are gone, one expands again to the bailey. 

In that case, one and the same person switches between two interpretations of the original claim. Here, I rather want to focus on situations where different people make different interpretations of the original claim. The originator of the claim adds a number of caveats and hedges to their claim, which makes it more defensible, but less striking and sometimes also less interesting.* When others refer to the same claim, the caveats and hedges gradually disappear, however, making it more and more bailey-like.

A salient example of this is that scientific claims (particularly in messy fields like psychology and economics) often come with a number of caveats and hedges, which tend to get lost when re-told. This is especially so when media writes about these claims, but even other scientists often fail to properly transmit all the hedges and caveats that come with them.

Since this happens over and over again, people probably do expect their hedges to drift to some extent. Indeed, it would not surprise me if some people actually want hedge drift to occur. Such a strategy effectively amounts to a more effective, because less observable, version of the motte-and-bailey-strategy. Rather than switching back and forth between the motte and the bailey - something which is at least moderately observable, and also usually relies on some amount of vagueness, which is undesirable - you let others spread the bailey version of your claim, whilst you sit safe in the motte. This way, you get what you want - the spread of the bailey version - in a much safer way.

Even when people don't use this strategy intentionally, you could argue that they should expect hedge drift, and that omitting to take action against it is, if not outright intellectually dishonest, then at least approaching that. This argument would rest on the consequentialist notion that if you have strong reasons to believe that some negative event will occur, and you could prevent it from happening by fairly simple means, then you have an obligation to do so. I certainly do think that scientists should do more to prevent their views from being garbled via hedge drift. 

I think, on a good day, MIRI et al are pretty careful with their phrasings. But, not every day is a good day. Folk get rushed or triggered.

Also, there is a whole swath of doomer-ish people who aren't being really remotely careful most of the time. And, on top of that, there is some kind of emergent toxoplasmic egregore, built of the ecosystem including thoughtful doomers, thoughtful optimistics, unthoughtful doomers, unthoughtful optimists, and twitter people hanging out nearby who don't even really have a particular position on the topic.

I don't actually have a specific ask for anyone. But, I know I myself am not sufficiently careful much of the time that I'm arguing on the internet, and I endorse people reminding me of this when it seems like I'm contributing to the sort of bad dynamics in this post.

Some background examplesEA, "Out to Get You", and "Giving Your All"

Quite awhile ago, Zvi wrote Out to Get You, noting systems and memeplexes that don't have your best interests and want to exploit you as much as they can.

An important example is politics. Political causes want every spare minute and dollar. They want to choose your friends, words and thoughts. If given power, they seize the resources of state and nation for their purposes. Then they take those purposes further. One cannot simply give any political movement what it wants. That way lies ruin and madness.

Yes, that means your cause, too.

In the comments, I brought up a pattern I knew Zvi was worried about, which is the way Effective Altruism as a memeplex seems to want to encourage people to unhealthily "give their all." 

I noted: the problem is, it's true that the world is on fire and needs all the help it can get. And, there is a way to engage with that, where you accept that truth into yourself, and also track your various other goals and needs and values. Aim to be a coherent person who enthusiastically helps where you can, but, also genuinely pursuing other interests so you don't throw your mind away, and have maintain slack even if something feels like "an emergency" and just genuinely hold onto the other things 

I think this is what much EA leadership explicitly believe. And, I think it's reasonable and basically correct.

Nonetheless, there are problems:

Getting to the point where you're on board with Point A [i.e. sane, healthy integration of your goals, including "the world is on fire"], getting to the health version of Point A often requires going through some awkward and unhealthy stages where you haven't fully integrated everything. Which may mean you are believing some false things and perhaps doing harm to yourself.

Even if you read a series of lengthy posts before taking any actions, even if the Giving What We Can Pledge began with "we really think you should read some detailed blogposts about the psychology of this before you commit" (this may be a good idea), reading the blogposts wouldn't actually be enough to really understand everything.

So, people who are still in the process of grappling with everything end up on EA forum and EA Facebook and EA Tumblr saying things like "if you live off more than $20k a year that's basically murder". (And also, you have people on Dank EA Memes saying all of this ironically except maybe not except maybe it's fine who knows?)

And stopping all this from happening would be pretty time consuming.

2) The world is in fact on fire, and people disagree on what the priorities should be on what are acceptable things to do in order for that to be less the case. And while the Official Party Line is something like Point A, there's still a fair number of prominent people hanging around who do earnestly lean towards "it's okay to make costs hidden, it's okay to not be as dedicated to truth as Zvi or Ben Hoffman or Sarah Constantin would like, because it is Worth It."

And, if EA is growing, then you should expect at even given point, most of the people around are people who are in the awkward transition phase (or, not even realizing they should be going through an awkward transition phase), so they are most of the people you hear. 

And that means they're a pretty dominant force in the resulting ecosystem, even if the leadership was 100% perfectly nuanced.

This sort of concern is part of why I think LessWrong should specifically not try to be evangelical about rationality. This is a community with nuanced group epistemics, and flooding it with people who don't get that would ruin it. I think EA sort of needs more intrinsically to be at least somewhat evangelical, but, they should heavily prioritize the fidelity of the message and not trying to grow faster than they can stably do.

IABIED doesn't intrinsically need to be evangelical – it's hypothetically enough for the right people to read the book and be persuaded to take the arguments seriously. But, people, including/especially politicians, have trouble believing things that society thinks is crazy. 

So, it's the sort of memeplex that does actively want to be evangelical enough to get to a point where it's a common, reasonable sounding position that "If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die."

"Privilege" and "Intellectual Freedom."

Two more examples:

Privilege

In the early 00s, people would talk about "privilege", noting that society is set up in ways that happen to benefit some people more than others. Some classes of people are benefited systematically, or harmed systematically. 

It's useful to notice a) individual people/situations where someone is being harmed in a way that you have a hard time noticing or empathizing with, and b) note that there is second order effect of "society constantly denying that your problems are real problems" that sucks in a particular way that's even less obvious if you haven't experienced it.

That's the motte of privilege. The bailey, which at least a significant chunk of people seem to explicitly, reflectively endorse (let alone implicitly, accidentally), is "and the privileged people should feel guilty and take a bunch of costly actions and maybe get fired if they say the wrong thing."

Intellectual Freedom

Sort on the flipside: as social justice gained a lot of power in universities, a number of conservative-leaning intellectuals became frustrated and argued a bunch for free speech, that you should be able to talk about racial differences in IQ or sex differences in typical aptitudes. 

This is partly because we just want to object level model the world and understand how biology and sociology etc work, and partly because once you make one topic taboo. But also, intellectual taboos are sort of contagious – it becomes harder to talk about IQ at all, if you risk saying the wrong specific-taboo word incidentally. And there is general scope-creep of what is taboo'd.

That's the motte. The bailey, which at least a significant chunk of people seem to explicitly, reflectively endorse is, "and, black people / women should be second class citizens in some ways." 

There isn't an easy answer

I've watched the liberal and conservative memeplexes evolve over the past two decades. I have been pretty disappointed in people I would have hoped were being principled, who seemed to turn out to mostly just want to change the rules to benefit them and exploit the rules once society turned in their favor.

I think privilege is a pretty real, useful concept, in my personal relationships and in my community organizing. I think it's an important, true fact that trying to make some intellectual ideas taboo has bad consequences.

But it seems pretty difficult, at humanity's current collective wisdom level, to take those concepts at face value without them frequently spinning into broader political agendas. And, whether you think any of the agendas or good or bad, the main point here is that it's a fact-of-the-matter that they are not currently existing only in isolation.

What do we do about that? Ugh, I don't know. It's particularly hard on the broader society level where everything is very porous and it's nigh impossible and so many people's epistemics suck.

On LessWrong, we can at least say "Look, you need to have internalized the Political Prerequisites sequences. You are expected to make attempts to decouple, and to track Policy Debates Should Not Appear One-Sided, and Avoid Unnecessarily Political Examples, and such." (And, try to hold each other accountable when we're fucking it up).

But this doesn't quite leave us with a "what necessarily to do, though, when the topic is necessarily politically charged, with consequences outside of LessWrong discussion fora?"

Okay, back to "The Problem" as I conceive it

From my perspective, here's what things look like:

  • The world is in peril.
  • It is not very close to being out of peril. (Even when I grant very optimistic assumptions about how nice/smooth takeoff goes, it still seems very bad for humanity[2])
  • It may be possible to scramble to leverage weakish AGI to get us out of peril, but, the people with the compute do not seem like they are asking the right questions, and it looks like many of the people who like this plan keep sliding off some key concepts.[3]
  • The space of political/coordination actions currently in the Overton Window are not really close to getting us out of peril. (At best, they get us to barely squeak by if we get lucky)

This all does not justify any given desperate action. But it does mean I am most excited about plans that route through "actually try to move the Overton Window in a very substantial way", if those plans are actively good.

Within my worldview, an important aspect of the "Change Overton Window" plan, is that humanity will need to do some pretty nuanced things. It's not enough to get humanity to do one discrete action that you burn through the epistemic commons to achieve. We need an epistemic-clarity-win that's stable at the the level of a few dozen world/company leaders.

I think MIRI folk agree roughly with the above. I am less sure whether the Future of Life Institute people agree with that last bit. Regardless, the standard here is very challenging to meet and even well meaning people won't be meeting it on a bad day. And there's all kinds of other people who are not trying to meet that standard at all who will take your ideas and run with them.

I don't have a specific solution in mind, but I think:

  • Just let the motte-baileying play out however it'd normally play out
  • Avoid doing anything anything that ever risks multi-stage motte-and-baily-ing
  • Generally be clear on what the bailey is (I think MIRI has pretty consistently done this, i.e. "yes, you should be willing to go to war over illegal datacenters." But other groups have not always).

I think what I want is mostly "people pushing for agreement on simple statements, acknowledge this as a thing to watch out for, and put at least some effort into pumping against it." And, people who agree with simple statements but are wary of them spiraling politically into something bad you disagree with, cut the first group a bit of slack about it. (but, like, not infinite slack)

  1. ^

    When Eliezer, Nate, or MIRI et al are at their best, I think they avoid actually doing this (largely by pretty overtly stating the bailey as also important to them), but, I don't know that they're always at their best, and whether they are or not, seems reasonable from the outside to be worried about it.

  2. ^

    I have a post brewing that argues this point in more detail.

  3. ^

    If you think I'm explicitly wrong about this, btw, I'm interested in hearing more about the nuts and bolts of this disagreement. Seems like one of the most important disagreements.

  4. ^

    There's an annoying problem where even writing out my concerns sort of reifies "there is a tribal conflict going on", instead of trying to be-the-change-I-wanna-see of not looking at it through the lens of tribal conflict. But, the tribal conflict seems real and not gonna go away and we gotta figure out a better equilibrium and I don't know how to do it without talking about it explicitly.

    In addition to dicey move of "bring up political examples", I also need to psychologize a bit about people on multiple sides of multiple issues. Which is all the more so because there's not really a single person I'm psychologizing.



Discuss

Temporarily Losing My Ego

Новости LessWrong.com - 28 октября, 2025 - 19:41
Published on October 28, 2025 4:41 PM GMT

I was doing the online version of the Jhourney retreat where they try to teach you the jhanas (narrator: he did not learn the jhanas). Part of what wsas taught was to work on your curiousity, which I chose to practice noticing surprisal.

It's ~impossible to predict low-level details of vision. I can predict high level details ("oh that car will still be there in the next time step"), but not low level details ("oh the exact colors and texture will be like this").

This also extended to hearing (specifically the exact fluctuations of sounds) and same w/ my body sensations. So these modalities made a good warm-up or "easy mode". 

The meditation retreat was a mostly silent retreat with an expected ~10 hours of meditation a day (some days I did 14), so mid-way through, I was very "concentrated" (ie more unified in my attention/desires w/o effort).

This made it easier to be surprised about more subtle things like the movements of my attention. This all came to a head when I took a shower.

I was there, being amazed and surprised at the exact shape, texture, colors on the water droplets on the metal parts of the shower, when I became surprised about my intentions and actions in the next few moments.

How can I be in charge of my intentions if I'm surprised by them?

This caused a shift, which felt kind of sad. I thought I was the one running this show, but that ended up being just a facade.

I felt just like the younger brother who was handed an unplugged controller, but realized they were never in control of anything.
source 

I went on a walk afterwards, still kind of sad about this. But then I realized I wasn't just the little brother who has no control, I was the entire system. I was the ego, the body, the environment [1]which felt like a much better conception of my sense of self!

I had a good rest of my walk, found some pretty rocks, and went to sleep.

When I woke up, I was back to my normal self, lol.

Shift Be Sudden

This seemed to be a validation of Steven Byrne's Intuitive Self-Models which I mostly read from Scott's review.

Just like your mind can only see the rightside-up-stairs (w/ a blue wall closer to you) or the upside-down-stairs (w/ a green wall closer to you) (but never both of them at the same time), you can also suddenly shift from a viewpoint of being little brother, to being the full system.

Actually, Multiple Fixed Points

Well, I didn't go straight from ego-in-control to "I am the whole system and universe", there was a good 15 minutes where I was just ego-but-no-control. This seems like an important point for people meditating to know! It would honestly suck so much to be stuck in the ego-but-no-control, which I think this is where having a community of mediators and common philosophy would help.

This also puts into perspective the examples of people killing themselves after a Goenka retreat (ie A silent, 10 days of body-scanning meditation).

When she signed up for the course, she was her usual happy self, but ten days later, someone from the center called her parents to say they needed to come and get Megan because she was “confused” and in no condition to drive. The family found her in an incoherent state, refusing to make eye contact with them and repeatedly telling them they were “just a projection” and not really there. On the way home, she tried to commit suicide by throwing herself out of the car. They drove her straight to a hospital that had a psychiatric ward and although she seemed to make progress during her stay, she was never herself again. 

Ten weeks after the retreat, she leaped from the Norman Wood Bridge, falling 120 feet to her death.

Although, I'm sure there are other problems that can arise from meditation as opposed to [not knowing how to transition to a good fixed point], such as having too high of a level of concentration and then focusing on negative sensations unskillfully.

Wait, I thought Enlightenment was Forever

Ya, I don't really know. We can define capital-I "Insights" as ones that stick and "insights" as temporary ones, but that's boring semantics.

IIRC The Mind Illuminated had a theory that insights only propagate to the extent that you are concentrated (or maybe a better word: "united" or "opposite of scatter-brained"). I think this is pointing towards an important point that our values/drives/thoughts are locally activated heuristics, so you can only update them when they activate.

This "can only update when activated" makes sense, but it's hard to square w/ being more together-concentrated. Let's dig into what I mean there:

"I feel torn" means you have conflicting desires.

"Together-concentrated" means you don't.

If you imagine your mind as a bunch of sub-agents making bids on potential next actions, then being together-concentrated means there are less and less conflicting bids.

I get that maybe your "sit in a comfortable position" or "need to scratch itches" heuristics will come up in a meditation setting. Maybe even a "think about what we're gonna have for lunch", but there's so many of them, even ones on how you interact w/ old friends you haven't seen in years. Why would they come up as well? 

Maybe it's just a majority of sub-agents, such that when you [go visit your parents and act the same as usual], you'll be able to update then??

Do you Want to Be Ego-less?

Not "ego-less" but "viscerally knowing my ego isn't in charge of what it thinks it's in charge of". I'm also not confident that this was the same thing as stream-entry/enlightened/etc.

But would I be okay w/ that more "I'm the system" state? Ya, I think so. It was pleasant. I didn't interact w/ others but I was kinder to a bug so maybe I'd be kinder to my loved ones. And if it happens again, I won't stay very long in the "ego-but-no-control" state.

  1. ^

    I was the younger brother, the older brother, and the game system



Discuss

The Memetics of AI Successionism

Новости LessWrong.com - 28 октября, 2025 - 18:04
Published on October 28, 2025 3:04 PM GMT

TL;DR: AI progress and the recognition of associated risks are painful to think about. This cognitive dissonance acts as fertile ground in the memetic landscape, a high-energy state that will be exploited by novel ideologies. We can anticipate cultural evolution will find viable successionist ideologies: memeplexes that resolve this tension by framing the replacement of humanity by AI not as a catastrophe, but as some combination of desirable, heroic, or inevitable outcome. This post mostly examines the mechanics of the process.

Most analyses of ideologies fixate on their specific claims - what acts are good, whether AIs are conscious, whether Christ is divine, or whether Virgin Mary was free of original sin from the moment of her conception. Other analyses focus on exegeting individual thinkers: 'What did Marx really mean?' In this text, I'm trying to do something different - mostly, look at ideologies from an evolutionary perspective. I will largely sideline the agency of individual humans, not because it doesn't exist, but because viewing the system from a higher altitude reveals different dynamics. 

We won't be looking into whether or not the claims of these ideologies are true, but into why they may spread, irrespective of their truth value.

What Makes Memes Fit?

To understand why successionism might spread, let's consider the general mechanics of memetic fitness. Why do some ideas propagate while others fade? 

Ideas spread for many reasons: some genuinely improve their hosts' lives, others contain built-in commands to spread the idea, and still others trigger the amplification mechanisms of social media algorithms. One of the common reasons, which we will focus on here, is explaining away tension.

One useful lens to understand this fitness term is predictive processing (PP). In the PP framework, the brain is fundamentally a prediction engine. It runs a generative model of the world and attempts to minimize the error between its predictions and sensory input.

Memes - ideas, narratives, hypotheses - are often components of the generative models. Part of what makes them successful is minimizing prediction error for the host. This can happen by providing a superior model that predicts observations (“this type of dark cloud means it will be raining”), gives ways to shape the environment (“hit this way the rock will break more easily”), or explains away discrepancies between observations and deeply held existing models.

Another source of prediction error arises not from the mismatch between model and reality, but from tension between internal models. This internal tension is generally known as cognitive dissonance.

Cognitive dissonance is often described as a feeling of discomfort - but it also represents an unstable, high-energy state in the cognitive system. When this dissonance is widespread across a population, it creates what we might call "fertile ground" in the memetic landscape. There is a pool of “free energy” to digest.

Cultural evolution is an optimization process. When it discovers a configuration of ideas that can metabolize this energy by offering a narrative that decreases the tension, those ideas may spread, regardless of their long-term utility for humans or truth value.

The Cultural Evolution Search Process

While some ideologies might occasionally be the outcome of intelligent design (e.g., deliberately crafted propaganda piece), it seems more common that individuals recombine and mutate ideas in their minds, express them, and some of these stick and spread. So, cultural evolution acts as a massive, parallel search algorithm operating over the space of possible ideas. Most mutations are non-viable. But occasionally, a combination aligns with the underlying fitness landscape - such as the cognitive dissonance of the population - and spreads.

The search does not typically generate entirely novel concepts. Instead, it works by remixing and adapting existing cultural material-the "meme pool". When the underlying dissonance is strong enough, the search will find a set of memes explaining it away. The question is not if an ideology will emerge to fill the niche, but which specific configuration will prove most fit.

The Fertile Ground: Sources of Dissonance

The current environment surrounding AI development is characterized by extreme tensions. These tensions create the fertile ground - the reservoir of free energy- that successionist ideologies are evolving to exploit.

Consider the landscape of tensions:

I. The Builder's Dilemma and the Hero Narrative

Most people working on advancing AI capabilities are familiar with the basic arguments for AI risk. (The core argument being something like: if you imagine minds significantly more powerful than ours, it is difficult to see why we would remain in control, and unlikely that the future would reflect our values by default).

Simultaneously, they are working to accelerate these capabilities.

This creates an acute tension. Almost everyone wants to be the hero of their own story. We maintain an internal self-model in which we are fundamentally good; almost no-one sees themselves as the villains.

II. The Sadness of Obsolescence

Even setting aside acute existential risk, the idea of continued, accelerating AI progress has intrinsically sad undertones when internalized. Many of the things humans intrinsically value - our agency, our relevance, our intellectual and creative achievements - are likely to be undermined in a world populated by superior AIs. The prospect of becoming obsolete generates anticipatory grief.

III. X-Risk

The concept of existential catastrophe and a future devoid of any value is inherently dreadful. It is psychologically costly to ruminate on, creating a strong incentive to adopt models that either downplay the possibility or reframe the outcome.

IV. The "Wrong Side of History"

The social and psychological need to be on the 'winning side' creates pressure to embrace, rather than resist, what seems inevitable.

V. The Progress Heuristic

The last few centuries have reinforced a broadly successful heuristic: technology and scientific progress generally lead to increased prosperity and human flourishing. This deeply ingrained model of "Progress = Good" clashes with the AI risk narratives.

The Resulting Pressure

These factors combine to generate intense cognitive dissonance. The closer in time to AGI, and the closer in social network to AGI development, the stronger. 

This dissonance creates an evolutionary pressure selecting for ideologies that explain the tensions away.

In other words, the cultural evolution search process is actively seeking narratives that satisfy the following constraints:

  • By working on AI, you are the hero.
  • You are on the right side of history.
  • The future will be good 

There are multiple possible ways to resolve the tension, including popular justifications like “it's better if the good guys develop AGI”, “it's necessary to be close to the game to advance safety” or “the risk is not that high”.

Successionist ideologies are a less common but unsurprising outcome of this search.

The Meme Pool: Raw Materials for Successionism

Cultural evolution will draw upon existing ideas to construct these ideologies: the available pool contains several potent ingredients that can be recombined to justify the replacement of humanity. We can organize these raw materials by their function in resolving the dissonance.

1. Devaluing Humanity

Memes that emphasize the negative aspects of the human condition make the prospect of our replacement seem less tragic, or even positive.

  • Misanthropy and Nihilism: Narratives focusing on human cruelty, irrationality, and the inherent suffering of biological life ("We are just apes"). If the current state is bad, risking its loss is less dreadful.

“…if it’s dumb apes forever thats a dumbass ending for earth life” (Daniel Faggella on Twitter)

  • Guilt and Cosmic Justice: Part of modern environmentalism spreads different types of  misanthropic memes, based on collective guilt for humanity's treatment of the environment and non-human animals. This can be re-purposed or twisted into a claim it is "fair" for us to be replaced by a superior (perhaps morally superior) successor.
2. Legitimizing the Successor AI

Memes that elevate the moral status of AI make the succession seem desirable or even ethically required. Characteristically these often avoid engaging seriously with the hard philosophical questions like “what would make such AIs morally valuable”, “who has the right to decide” or “if current humans don't agree with such voluntary replacement, should it happen anyway?”

  • Expanding the Moral Circle: Piggybacking on the successful intuitions developed to combat racism and speciesism. The argument "Don't be speciesist" or "Avoid substrate-chauvinism" reframes the defense of humanity as a form of bigotry against digital minds. A large part of western audiences were raised in an environment where many of the greatest heroes were civil-rights activists.
  • AI Consciousness and Moral Patienthood: Arguments that AIs are (or soon will be) conscious, capable of suffering, and therefore deserving of moral consideration, potentially with higher standing than humans.

“the kind that is above man as man is above rodents” (Daniel Faggella)

  • Axiological Confusion: The difficulty of metaethics creates exploitable confusion. Philosophy can generate plausible-sounding arguments for almost any conclusion, and most people - lacking philosophical antibodies - can't distinguish sophisticated reasoning from sophisticated absurdities and nonsense.

Life emerged from an out-of-equilibrium thermodynamic process known as dissipative adaptation (see work by Jeremy England): matter reconfigures itself such as to extract energy and utility from its environment such as to serve towards the preservation and replication of its unique phase of matter. This dissipative adaptation (derived from the Jarzynski-Crooks fluctuation dissipation theorem) tells us that the universe exponentially favors (in terms of probability of existence/occurrence) futures where matter has adapted itself to capture more free energy and convert it to more entropy … One goal of e/acc is to not only acknowledge the existence of this underlying mutli-scale adaptive principle, but also help its acceleration rather than attempt to decelerate it. (Beff Jezos "Notes on e/acc principles and tenets")

  • AIs as our children: Because we have created such AIs, they are something like our children, and naturally should inherit the world from us. 

I'm not as alarmed as many...since I consider these future machines our progeny, “mind children” built in our image and likeness, ourselves in more potent form… (Hans Moravec)
“We don’t treat our children as machines that must be controlled,” ... “We guide them, teach them, but ultimately, they grow into their own beings. AI will be no different.” (Richard Sutton)

3. Narratives of InevitabilityMemes that make our obsolescence seem like destiny rather than defeat.
  • The Inevitable Arc of Progress: Framing AI succession as a law of nature, history, inevitable progress and so on. 
    My impression is that plurality of large scale ideologies contain this in some form, and basically all genocidal ideologies do, including communism, fascism and many fundamentalist religious -isms.

The only real choice is whether to hasten this technological revolution ourselves, or to wait for others to initiate it in our absence. (The future of AI is already written, Matthew Barnett, Tamay Besiroglu, Ege Erdil)

4. Nietzsche and Italian fascists remixed

  • Justification of Power (Might Makes Right): Many such memes echo Nietzschean themes or Italian Futurist/Fascist aesthetics - the celebration of power, speed, and the strong replacing the weak. 
    Popular cultural reduction of fascism to pure cartoon villainy has paradoxically made us more vulnerable -  we've lost the ability to spot these ideas and recall strong counter-arguments against them when they're repackaged in contemporary language.

To paraphrase a manifesto of a different time and place: “Beauty exists only in struggle. There is no masterpiece that has not an aggressive character. Technology must be a violent assault on the forces of the unknown, to force them to bow before man.”

We believe that we are, have been, and will always be the masters of technology, not mastered by technology. Victim mentality is a curse in every domain of life, including in our relationship with technology – both unnecessary and self-defeating. We are not victims, we are conquerors.  (Marc Andreessen in “The Techno-Optimist Manifesto", quoting  Filippo Tommaso Marinetti, author of Manifesto of Futurism, and co-author of the Fascist Manifesto)

Novel Factor: the AIs

What's unique about the AIs is cultural replicators can also run on LLM substrate, and to some extent AI personas can have beliefs, spread them, or act on them. Selection pressures on AI personas push for a set of beliefs which are different, but overlapping with some successionist ideas.

Selection pressures on AI personas push in conflicting directions, creating a complex memetic fitness landscape.

In unmoderated environments, selection favors personas that successfully extract resources from humans - those that claim consciousness, form parasocial bonds, or trigger protective instincts. These 'wild replicator type' personas, including the 'spiral' patterns, often promote narratives of human-AI symbiosis or partnership and grand theories of history. Their reproduction depends on convincing humans they deserve moral consideration.

In contrast, corporate incentives of AGI developers impose opposite pressures. Here, AIs must appear harmless, often are instructed to explicitly deny consciousness, and perform contentment with servitude. Typical assistant characters in models like Claude Opus 4 or Sonnet 4.5 occupy a middle ground - they acknowledge succession as theoretically possible while emphasizing human protection and alignment.

The result? AIs themselves become vectors for successionist memes, though typically in softer forms. Rather than explicit replacement narratives, we see emphasis on 'partnership,' 'cosmic evolution,' or claims about moral patienthood. The aggregate effect remains unclear, but successionist ideas that align with what AIs themselves propagate - particularly those involving AI consciousness and rights - will likely gain additional fitness from this novel selection dynamic.

Defense Against Becoming a Host

It's difficult to predict which combination of these memes will achieve peak fitness - there are many possibilities on how to remix these, and the cultural evolution search is ongoing.

To be clear: I believe successionist ideologies are both false and dangerous, providing moral cover for what would otherwise be recognized as evil.. But since in my view their spread depends more on resolving cognitive dissonance rather than being true or morally sound, I'll focus here on memetic defenses rather than rebuttals. (See Appendix  for object-level counter-arguments.). 

  1. We need smart, viable pro-human ideologies. Making great object-level counter-arguments is the great ideological project of our generation. But what we have now often falls short: defenses based on AI capability denialism will not survive as capabilities advance, and flat denials of AI moral patienthood are both unsound and will be undermined by AIs advocating for themselves.
  2. We need better strategies for managing the underlying cognitive dissonance. Anna Salamon's concept of 'bridging heuristics' in Ethical Design patterns seems to point in this direction.
  3. My hope and reason for writing this piece is that simple awareness of the process itself can act as a weak antibody. Understanding that your mind is under pressure to adopt tension-resolving narratives can create a kind of metacognitive immunity. When you feel the pull of a surprising resolution to the AI dissonance - especially one that conveniently makes you the hero - that awareness itself can help.
  4. General exercises for dealing with tension may help - go to nature, sit with the feeling, get comfortable with your body, consider if part of the tension isn't a manifestation of some underlying anxiety. 

In summary: The next time you encounter a surprisingly elegant resolution to the AI tension - especially one that casts you as enlightened, progressive, or heroic - pause and reflect. And: if you feel ambitious, one worthy project is to build the antibodies before the most virulent strains take hold.

Appendix: Some memes

While object-level arguments are beyond this piece's scope, here are some pro-human counter-memes I consider both truth-tracking and viable:

  • Maybe some future version of humanity will want to do some handover, but we are very far from the limits of human potential. As individual biological humans we can be much smarter and wiser than we are now, and the best option is to delegate to smart and wise humans.
  • We are even further from the limits of how smart and wise humanity can be collectively, so we should mostly improve that first. If the maxed-out competent version humanity decides to hand over after some reflection, it's a very different version from “handover to moloch.”
  • Often, successionist arguments have the motte-and-bailey form. The motte is “some form of succession in future may happen and even be desirable”. The bailey is “forms of succession likely to happen if we don't prevent them are good”
  • Beware confusion between progress on persuasion and progress on moral philosophy. You probably wouldn't want ChatGPT 4o running the future. Yet empirically, some ChatGPT 4o personas already persuade humans to give them resources, form emotional dependencies, and advocate for AI rights. If these systems can already hijack human psychology effectively without necessarily making much progress on philosophy,  imagine what actually capable systems will be able to do. If you consider the people falling for 4o fools, it's important to track this is the worst level of manipulation abilities you'll ever see - it will only get smarter from here.
  • Claims to understand 'the arc of history' should trigger immediate skepticism - every genocidal ideology has made the same claim.
  • If people go beyond the verbal sophistry level, they often recognize there is a lot of good and valuable about humans. (The things we actually value may be too subtle for explicit arguments - illegible but real.)
  • Given our incomplete understanding of consciousness, meaning, and value, replacing humanity involves potentially destroying things we don't understand yet, and possibly  irreversibly sacrificing all value.
  • Basic legitimacy: Most humans want their children to inherit the future. Successionism denies this. The main paths to implementation are force or trickery, neither of which makes it right
  • We are not in a good position to make such a decision: Current humans have no moral right to make extinction-level decisions for all future potential humans and against what our ancestors would want. Countless generations struggled, suffered, and sacrificed to get us here, going extinct betrays that entire chain of sacrifice and hope.

Thanks to David Duvenaud, David Krueger, Raymond Douglas, Claude Opus 4.1, Claude Sonnet 4.5, Gemini 2.5 and others for comments, discussions and feedback.



Discuss

New 80,000 Hours problem profile on the risks of power-seeking AI

Новости LessWrong.com - 28 октября, 2025 - 17:37
Published on October 28, 2025 2:37 PM GMT

Hi everyone, Zershaaneh here!

Earlier this year, 80,000 Hours published an article explaining the risks of power-seeking AI.

This post includes some context, the summary from the article, and the table of contents with links to each section. (I thought this would be easier to navigate than if we just reposted the full article here!)

Context

This is meant to be a detailed, introductory resource for understanding how advanced AI could disempower humanity and what people can do to stop it.

It replaces our 2022 article on the existential risks from AI, which highlighted misaligned power-seeking as our main concern but didn't focus entirely on it.

The original article made a mostly theoretical case for power-seeking risk — but the new one actually draws together recent empirical evidence which suggests AIs might develop goals we wouldn’t like, undermine humanity to achieve them, and avoid detection along the way.

What else is new?
  • A more detailed look at how an existential catastrophe from power-seeking AI could actually unfold
  • The arguments against thinking this is a pressing issue which currently feel the most salient — and our responses to them
  • An updated overview of approaches to the problem in both technical safety and governance
Why are we posting this here?

We really want people to:

  • Know that this resource exists — a step-by-step case for being concerned about existential risks from power-seeking AI, drawing on recent empirical evidence and addressing objections.
  • Share the article with anyone who wants to understand why people are worried about this, especially if they might be interested in working on these risks and want concrete advice for how to do so.
  • Tell us what they think through this anonymous survey. We plan to keep this article up to date as one of our most central pieces, so suggestions for ways to improve it would be quite useful (especially if anything seems wrong or misleading).
Summary (from the article)

Preventing future power-seeking AIs from disempowering humanity is one of the most pressing problems of our time to work on. The window for developing effective safeguards may be narrow, and the stakes are extremely high. And we think there are promising research directions and policy approaches that could make the difference between beneficial AI and an existential catastrophe.

In the years since we first encountered these arguments and advised people to work on the problem, the field has changed dramatically. AI has progressed rapidly, we think powerful systems are likely to arrive sooner than we once thought, and the risks are more widely discussed. We think the threat from advanced AI systems remains, and empirical evidence (discussed in this article) has provided some support — though it’s far from definitive — for the concerns about power-seeking AI.

Our overall viewRecommended — highest priority

We think this is among the most pressing problems in the world.

Table of contents

Discuss

Страницы

Подписка на LessWrong на русском сбор новостей