Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 1 день 2 часа назад

How Should Political Situations Be Classified In Order To Pick The Locally Best Voting System For Each Situation?

1 января, 2026 - 01:49
Published on December 31, 2025 10:49 PM GMT

Epistemic Status: I'm confused! Let's go shopping! (...for new political systems <3)

I want to write an essay about the actually best voting system, but before I do that I want to get clear on what the desiderata should even naturally or properly or wisely be...

Participation?

Sometimes it is illegal to not vote. You could create a two day holiday, and have 24 hour emergency workers do shifts but have some time off to go in and be fingerprinted and register their preferences and so on. There could be free money at the polling station for voting, and voting assistants hunting down the people who haven't voted yet.

If you have this system, then "refusing to vote" can never happen.

But also, certain voting systems fail the Participation criteria such that some people might wish, in retrospect, to have turned in a ballot that says NULL (and makes it possible for the election to fail quorum?) rather than turning in a ballot.

On the other hand, if a polity uses a system that FAILS the Participation criteria AND ALSO it forces everyone to vote, then maybe it would be unethical to have forced people though the puppet show of pretending to be able to express their civic preferences without them actually being able to express their civic preferences?

On the gripping hand, maybe if you're trying to boot up a new polity from scratch (as was attempted in Iraq, after George W Bush invaded that country in 2003) maybe you really really really want to incentivize people to vote for a bit just to "get the thing started"? Maybe Participation is super important for building and merging rather than shrinking and splitting? Maybe Metcalfe's Law is relevant to polities? Is bigger always better?

Forking?

Sometimes a country's citizenship is very valuable (like the US has a citizenship like this, but it isn't the most valued-in-practice citizenship from the "cost to become citizen" estimates I can find) and other country's citizenship is net negative, with people trying to escape. Sometimes a lot of people want to escape all at the same time. Also, maybe certain election results will cause some large faction of citizens to want to exert their right to revolution, and break away? (Or maybe there is no moral right to revolution? Or maybe whether there is a right to revolution is culture dependent?) And so maybe it is a positive feature of an election if "None Of The Above For A Single Polity / Break The Polity In Two With These TWO Leaders" is a possible outcome? Or not? How would we know?

According to The CAP Theorem, if you refuse to allow Forking then you MUST choose between Availability and Consistency in your system design... but when is Forking really bad and when is Forking actually maybe kinda good?

Something I notice: there is very very little attention paid to the "polity merge operation" where two polities might be separate, and both hold elections, and then end up merged at the end, and it somehow goes very smoothly and nicely, because they were, in some sense, already "running the same civic operating system" and that civic operating system is able to fork and merge by design. Maybe if all the US states were running civic operating systems that support this behavior somehow, then maybe the state boundaries wouldn't be fucked beyond belief and very very very far from the naturally good places for them to be?

Sauce.

Objective Evil?

Maybe there are systematic insanities latent in human nature, and the median leader preferred by almost everyone in head-to-head pairwise comparisons would turn out to be someone who is "objectively" very evil, and wants to do something like commit genocide on 15% of the population (or whatever... if you are personally in favor of genocide then imagine I said some other "clear moral evil" that you would see as a violation of Natural Law (or whatever standard you use for deciding if something is ethical or unethical based on a coherent conscience that is distinct from "whatever the fuck you merely feel like you want right now"), but which might also be predictably something that a majority of people in some country would simply want).

If human's just really love to do evil a lot in practice (or certain humans in certain situations?) then their collectively most preferred outcome "in the middle" where it "seems common-sensically preferable to most of them" with the Condorcet Criterion might misfire, and reliably generate one evil leader after another.

In practice, in the US, with our POTUS elections, it seems like we reliably get a POTUS that some large fraction of the country really really dislikes but also, if you look at the polling data, and the third party options, if POTUS elections reliably selected the Condorcet Winner from among the top 1000 people who got enough signature to be in the election, then... NONE of the recent past Presidents would have won, most likely? It would have been a bunch of namby-pamby libertarian environmentalists who believe in civic virtue, self defense, small government, and prosperity, over and over and over.

Maybe "namby-pamby libertarian environmentalists who believe in civic virtue, self defense, small government, and prosperity" is an objectively evil platform for a leader to adopt, and something America should not want to reliably elect over and over? So maybe we shouldn't have POTUS elections that fulfill the Condorcet Criterion? Or maybe I'm wrong about what Condorcet Criterion satisfying leaders would look like here?

Also, maybe different cultures are more or less "objectively good or evil", and only the "evil cultures" should avoid the Condorcet Criterion, whereas the "good cultures" should adopt it? (This would assume some pragmatically relevant variant of moral realism is true, of course, and maybe no variant of moral realism at all, in any form, is true?)

Preference Strengths?

Right now the federal minimum wage in the United States is $7.25 and so working fulltime for two days would earn $116 which we can round to $100 for ease of mental math.

Hypothetically, people could go to polling stations and be given $100 to show up and vote "I'm a sheep and I don't even care but I know I like money and so I'm not voting but I'm just gonna take the money".

Then you'd have to refuse the $100 to actually vote at normal strength.

Then you could pay $100 to vote with 2X weight.

And then for $333 you could vote with 3X weight, and pay $1000 to vote with 4X weight, and pay $3333 for 5X and pay $10,000 for 6X, and so on all the way up to paying billions of dollars in optional taxes?

Quadratic Voting was the new hotness for a while in mechanism design but it fundamentally presumes an "allocation of goodies to whoever wants the goodies the most" mindset. Some people want high taxes and large handouts because they are poor, and other people want low taxes and few handouts because they are rich, for one example, and presumably these "selfish motivations" in BOTH directions are "not really about ethics and fairness"? It probably connects to deep questions like the moral issue of compulsory charitable giving.

One nice thing about soliciting preferences, is that revolutions are very very costly, and if you have 60% of the population who wants to murder and eat the other 40% just a little bit as one of many possible things they could eat (and the 40% would instantly choose to revolt against the government if the government tried to implement this policy) then letting the 40% pay the 60% a little bit of money to control the government despite being in the minority, and use their government control to make the government NOT try to kill them thereby, it would be cheaper and better for everyone overall?

Truth Solicitation?

A different frame would be that everyone is assumed to be enlightened, and wanting to know the truth, and express the truth, but uncertain.

Maybe people lean towards truth on average, and then we can use the Condorcet Jury Theorem to aggregate uncertainty into higher quality beliefs about what the best way for the polity to proceed would be?

Then again... if you seriously wanted to get the truth, then presumably there are better ways to do this than force everyone to vote (ooh! the Participation criteria showed up again!) but instead hire experts, and use Bayesian Truth Serum, and have betting markets for a lot of it.

Maybe it depends on the complexity of the questions being faced? Maybe if the issues are very simple then everyone already knows the right answers and truth solicitation is pointless to optimize for, but if the issues are very complex, and being wrong would hurt a lot, then maybe an electoral system being performant on this dimension could be The Ultimate Thing To Get Right?

Moral Maze Resistance?

Something that often happens in organizations that exist for more than about 8 years (which is roughly how long someone is a CEO in most for profit companies, and also the term limit for President) and have more than about 150 people (such that anonymity can creep in above that number) is that it turns into a Moral Maze ruled de facto according to the Iron Law Of Oligarchy at the top, and patrimonial bureaucratic norms in the middle.

When this happens, it is very common for the humans at the top to be there because they want to abuse their power for personal gain, deriving joy and wealth and reproductive success from the unbalanced exercise of social power, rather than engaging in servant leadership.

When political scientists look at polities, they find that if there is a single-party unicameral parliament with no proportional representation (especially not the kind that is resistant to gerrymandering), then you almost certainly will end up with rampant corruption. Forcing there to be >1 parties somehow helps reduce corruption. Making two different Houses have to agree on legislation that is finalized before either of them votes helps reduce corruption. Proportional representation might be a sufficient solution all by itself? Except when I searched again for new papers on this topic it apparently matters A LOT whether the proportional representation is "open list" vs "closed list". The closed list option is the bad one.

If you look at Wikipedia's awesome and good "Comparison Of Electoral Systems" you will not find "resistant to Moral Mazes and conducive to Low Corruption Multiparty Outcomes" as one of the criteria, even though this might be literally the most important thing?

But also the need for this might be of very low importance for a city state full of philosophically wise saints?

But also, if you're trying to reduce Forking, and trying to get people to Participate, then maybe no one will want to participate if they can't have a little bit of corruption... as a treat?

Anyway, there's a huge literature on this stuff, figuring out empirically what systems have the most coups, and most corruption, and so on and so forth. I'm not an expert on this literature, and that's why I'm asking a question rather than writing an essay <3

I honestly don't know.

Other Factors?

Surely I'm missing a lot of factors.

This is, after all, a post that is marked as a question.

What are the important factors to look at in a polity to help that polity even decide what the right desiderata are for picking an electoral system?



Discuss

AI Futures Timelines and Takeoff Model: Dec 2025 Update

1 января, 2026 - 01:34
Published on December 31, 2025 10:34 PM GMT

We’ve significantly upgraded our timelines and takeoff models! It predicts when AIs will reach key capability milestones: for example, Automated Coder / AC (full automation of coding) and superintelligence / ASI (much better than the best humans at virtually all cognitive tasks). This post will briefly explain how the model works, present our timelines and takeoff forecasts, and compare it to our previous (AI 2027) models (spoiler: the AI Futures Model predicts about 3 years longer timelines to full coding automation than our previous model, mostly due to being less bullish on pre-full-automation AI R&D speedups).

If you’re interested in playing with the model yourself, the best way to do so is via this interactive website: aifuturesmodel.com

If you’d like to skip the motivation for our model to an explanation for how it works, go here, The website has a more in-depth explanation of the model (starts here; use the diagram on the right as a table of contents), as well as our forecasts.

Why do timelines and takeoff modeling?

The future is very hard to predict. We don't think this model, or any other model, should be trusted completely. The model takes into account what we think are the most important dynamics and factors, but it doesn't take into account everything. Also, only some of the parameter values in the model are grounded in empirical data; the rest are intuitive guesses. If you disagree with our guesses, you can change them above.

Nevertheless, we think that modeling work is important. Our overall view is the result of weighing many considerations, factors, arguments, etc.; a model is a way to do this transparently and explicitly, as opposed to implicitly and all in our head. By reading about our model, you can come to understand why we have the views we do, what arguments and trends seem most important to us, etc.

The future is uncertain, but we shouldn’t just wait for it to arrive. If we try to predict what will happen, if we pay attention to the trends and extrapolate them, if we build models of the underlying dynamics, then we'll have a better sense of what is likely, and we'll be less unprepared for what happens. We’ll also be able to better incorporate future empirical data into our forecasts.

In fact, the improvements we’ve made to this model as compared to our timelines model at the time we published AI 2027 (Apr 2025), have resulted in a roughly 2-4 year shift in our median for full coding automation. This has primarily come from improving our modeling of AI R&D automation. These modeling improvements have resulted in a larger change in our views than the new empirical evidence that we’ve observed. You can read more about the shift below.

Why our approach to modeling? Comparing to other approachesAGI[1] timelines forecasting methodsTrust the experts

Unfortunately, there is nothing close to an expert consensus, and it doesn’t seem like most experts have thought much about AGI forecasting (e.g. a 2023 survey observed huge framing effects depending on whether they asked for probabilities of milestones being achieved by certain years, or instead asked for years that correspond to percentiles). That 2023 survey of AI academics got an AGI median of 2047 or 2116, depending on the definition.[2] There’s also this aggregation of Metaculus and Manifold markets which estimates 50% by 2030. As for the people building the technology, they tend to be more bullish; the most extreme among them (Anthropic and OpenAI) say things like 2027 and 2028. For a survey of older predictions and how they’ve fared, see this.

Given that experts disagree with each other and mostly seem to have not thought deeply about AGI forecasting, we think it’s important to work to form our own forecast.

Intuition informed by arguments

Can the current paradigm scale to AGI? Does it lack something important, like common sense, true original thinking, or online/continual learning (etc.)? Questions like these are very important and there are very many of them, far too many to canvas here. The way this method works is that everyone ingests the pile of arguments and considerations and makes up their own minds about which arguments are good and how they weigh against each other. This process inherently involves intuition/subjective-judgment, which is why we label it as “intuition.”

Which is not to denigrate it! We think that any AI forecaster worth their salt must engage in this kind of argumentation, and that generally speaking the more facts you know, the more arguments you’ve considered and evaluated, the more accurate your intuitions/vibes/judgments will become. Also, relatedly, your judgment about which models to use, and how much to trust them, will get better too. Our own all-things-considered views are only partially based on the modelling we’ve done; they are also informed by intuitions.

But we think that there are large benefits to incorporating quantitative models into our forecasts: it’s hard to aggregate so many considerations into an overall view without using a quantitative framework. We’ve also found that quantitative models help prioritize which arguments are most important to pay attention to. And our best guess is that overall, forecasts by quantitative trend extrapolation have a better historical track record than intuitions alone.

Revenue extrapolation

Simple idea: extrapolate AI revenue until it’s the majority of world GDP. Of course, there’s something silly about this; every previous fast-growing tech sector has eventually plateaued… That said, AI seems like it could be the exception, because in principle AI can do everything. Now that AI is a major industry, we think this method provides nonzero evidence. According to this Epoch dataset, frontier AI company revenue is something like $20B now and growing around 4.1x/yr. This simple extrapolation gets to $100T annualized revenue around the end of 2031.[3]

We give weight to revenue extrapolation in our all-things-considered views, but on the other hand revenue trends change all the time and we’d like to predict the underlying drivers of how it might change. Also, it’s unclear what revenue threshold counts as AGI. Therefore, we want to specifically extrapolate AI capabilities.

Compute extrapolation anchored by the brain

The basic idea is to estimate how much compute it would take to get AGI, anchored by the human brain. Then predict that AGI will happen when we have that much compute. This approach has gone through a few iterations:

  1. Hans Moravec, Ray Kurzweil, and Shane Legg pioneered this method, predicting based on the amount of operations per second that the human brain does. Moravec predicted AGI in 2010 in 1988, then revised it to 2040 in 1999. Kurzweil and Legg each predicted AGI in the late 2020s in about 2000.[4]
  2. Ajeya Cotra’s 2020 biological anchors report instead predicted AGI[5]based on how much compute it would take to train the human brain. Cotra also estimated how much algorithmic progress would be made, converting it into the equivalent of training compute increases to get “effective compute”. The report predicted a median of 2050.

Davidson’s Full Takeoff Model and Epoch’s GATE used the same method as bio anchors to determine the AGI training compute requirement, but they also modeled how AI R&D automation would shorten timelines. They modeled automation by splitting up AI software and hardware R&D into many tasks, then forecasting the effective compute gap between 20% task automation and 100% automation. The percentage of tasks automated, along with experiment compute and automation compute, determine the magnitude of inputs to AI R&D. These inputs are converted to progress in software efficiency using a semi-endogeneous growth model. Software efficiency is then multiplied by training compute to get effective compute.

At the time the FTM was created it predicted AGI in 2040, with the parameter settings chosen by Davidson. But both compute and algorithmic progress has been faster than they expected. When the FTM is updated to take into account this new data, it gives shorter medians in the late 2020s or early 2030s. Meanwhile, with GATE’s median parameters, it predicts AGI in 2034.

Overall, this forecasting method seems to us to have a surprisingly good track record: Moravec, Kurzweil, and Legg especially look to have made predictions a long time ago that seem to hold up well relative to what their contemporaries probably would have said. And our model follows these models by modeling training compute scaling, though in most of our simulations the majority of progress toward AGI comes from software.

Capability benchmark trend extrapolation

This is our approach! We feel that now, in 2025, we have better evidence regarding the AGI effective compute requirement than comparisons to the human brain: specifically, we can extrapolate AIs’ performance on benchmarks. This is how the timelines portion of our model works. We set the effective compute required for AGI by extrapolating METR’s coding time horizon suite, METR-HRS.


We think it’s pretty great. Benchmark trends sometimes break, and benchmarks are only a proxy for real-world abilities, but… METR-HRS is the best benchmark currently available for extrapolating to very capable AIs, in our opinion. We think it’s reasonable to extrapolate that straight line into the future for at least the next few years.[6]

METR itself did a simple version of this extrapolation which assumed exponential growth in time horizons in calendar time. But this doesn’t account for AI R&D automation, changes to human labor or compute growth, or the possibility of time horizon doublings getting easier or harder at higher horizons.[7]

Our previous timelines model took all of these into account, though more crudely than our new AI Futures Model. Our previous model with median parameters predicted superhuman coder (SC) medians of 2027 to 2028, while our new model predicts 2031. The difference mostly comes from improvements to how we’re modeling AI R&D automation. See below for details.

Post-AGI takeoff forecasts

The literature on forecasting how capabilities progress after full automation of AI R&D is even more nascent than that which predicts AGI timelines. Past work has mostly fallen into one of two buckets:

  1. Qualitative arguments or oversimplified calculations sketching why takeoff might be fast or slow: for example, Intelligence Explosion Microeconomics by Eliezer Yudkowsky (arguing for fast takeoff) or Takeoff speeds by Paul Christiano (arguing for slow takeoff).[8]
  2. Models of the software intelligence explosion (SIE), i.e. AIs getting faster at improving its own capabilities without additional compute: in particular, How quick and big would a software intelligence explosion be? by Davidson and Houlden.[9]

As in timelines forecasting, we think that qualitative arguments are valuable but we think that modeling is a useful complement to qualitative arguments.

Davidson and Houlden focuses primarily on trends of how much more efficiently AIs have been able to achieve the same performance when determining whether there will be an SIE.[10]Meanwhile, we focus on estimates of the quality of AIs’ research taste, i.e. how good the AI is at choosing research directions, selecting and interpreting experiments, etc. We think that focusing on research taste quality is a more useful lens from which to view a potential SIE. If there’s an SIE we expect that it will primarily be driven by improvements in research taste.

Furthermore, because our takeoff model is integrated into a more expansive quantitative model, we have other advantages relative to Davidson and Houlden. For example, we can account for increases in the AGI project’s compute supply.[11]

How our model works

On the web app, there’s an interactive diagram explaining the parts of the model and how they relate to each other, with a corresponding full model explanation:

Here we’ll just give a brief overview.

Our model’s primary output is the trajectory of AIs’ abilities to automate and accelerate AI software R&D. We also include milestones tracking general capabilities, but these are calculated very roughly.

Our model can intuitively be divided into 3 stages. Although the same formulas are used in Stages 1, 2, and 3, new dynamics emerge at certain milestones (Automated Coder, Superhuman AI Researcher), and so these milestones delineate natural stages.

Stage 1: Automating coding

First we’ll discuss how our model predicts when coding will be fully automated. Stage 1 predicts when an Automated Coder (AC) arrives.

Automated Coder (AC). An AC can fully automate an AGI project's coding work, replacing the project's entire coding staff.[12]

Our starting point is to take the METR graph and extrapolate it exponentially, as they do, making a guess about what agentic coding time horizon would correspond to the AC milestone.

However, this simple extrapolation misses out on many important factors, such as:

  • The inputs to AI progress — most notably compute, but also labor, data, etc. — won’t keep growing at the same rates forever. There’s a significant chance that growth rates will slow in the near future e.g. as we run up against limits of chip production, investment, recruiting pipelines, energy, etc. This could cause the trend to bend downwards.
  • Automation of AI R&D. Already many AI researchers claim that AI is accelerating their work.[13] The extent to which it is actually accelerating their work is unfortunately unclear, but probably there is a nonzero effect already and probably this acceleration effect will increase as AIs become more capable. This could cause the trend to bend upwards.
  • Superexponential time horizon growth (independent from AI R&D automation). Eventually there will be AI systems which outperform humans at all horizon lengths; therefore, the trend should eventually shoot to infinity.[14] Therefore, we think we should use a superexponential trend rather than an exponential trend. (This is confusing and depends on how you interpret horizon lengths, see here for more discussion. If you disagree with this, our model allows you to use an exponential trend if you like, or even subexponential.)

Our model up through AC still centrally involves the METR trend,[15] but it attempts to incorporate the above factors and more. It also enables us to better represent/incorporate uncertainty, since we can do Monte Carlo simulations with different parameter settings.

Stage 2: Automating research taste

Besides coding, we track one other type of skill that is needed to automate AI software R&D: research taste. While automating coding makes an AI project faster at implementing experiments, automating research taste makes the project better at setting research directions, selecting experiments, and learning from experiments.

Stage 2 predicts how quickly we will go from an automated coder (AC) to a Superhuman AI researcher (SAR), an AI with research taste matching the top human researcher.

Superhuman AI Researcher (SAR): A SAR can fully automate AI R&D, making all human researchers obsolete.[16]

The main drivers of how quickly Stage 2 goes is:

  1. How much automating coding speeds up AI R&D. This depends on a few factors, for example how severely the project gets bottlenecked on experiment compute.
  2. How good AIs' research taste is at the time AC is created. If AIs are better at research taste relative to coding, Stage 2 goes more quickly.
  3. How quickly AIs get better at research taste. For a given amount of inputs to AI progress, how much more value does one get per experiment?
Stage 3: The intelligence explosion

Finally, we model how quickly AIs are able to self-improve once AI R&D is fully automated and humans are obsolete. The endpoint of Stage 3 is asymptoting at the limits of intelligence.

The primary milestones we track in Stage 3 are:

  1. Superintelligent AI Researcher (SIAR). The gap between a SIAR and the top AGI project human researcher is 2x greater than the gap between the top AGI project human researcher and the median researcher.[17]
  2. Top-human-Expert-Dominating AI (TED-AI). A TED-AI is at least as good as top human experts at virtually all cognitive tasks. (Note that the translation in our model from AI R&D capabilities to general capabilities is very rough.)[18]
  3. Artificial Superintelligence (ASI). The gap between an ASI and the best humans is 2x greater than the gap between the best humans and the median professional, at virtually all cognitive tasks.[19]

In our simulations, we see a wide variety of outcomes ranging from a months-long takeoff from SAR to ASI, to a fizzling out of the intelligence explosion requiring further increases in compute to get to ASI.

To achieve a fast takeoff, there usually needs to be a feedback loop such that each successive doubling of AI capabilities takes less time than the last. In the fastest takeoffs, this is usually possible via a taste-only singularity, i.e. the doublings would get faster solely from improvements in research taste (with no improvements in coding, or extra compute). Whether a taste-only singularity occurs depends on which of the following dominates:

  1. The rate at which (experiment) ideas become harder to find. Specifically, how much new “research effort” is needed to achieve a given increase in AI capabilities.
  2. How quickly AIs' research taste improves. For a given amount of inputs to AI progress, how much more value does one get per experiment?

Continued improvements in coding automation matter less and less, as the project gets bottlenecked by their limited supply of experiment compute.

Timelines and takeoff forecasts

The best place to view our results is at https://www.aifuturesmodel.com/forecast.

In this section we will discuss both our model’s outputs and our all-things-considered views. As previously mentioned, we are uncertain, and don’t blindly trust our models. Instead we look at the results of the model but then ultimately make adjustments based on intuition and other factors. Below we describe the adjustments that we make on top of this model, and the results.

Eli

Here is the model’s output with my parameters along with my all-things-considered views.


To adjust for factors outside of the model, I’ve lengthened timelines (median from late 2030 to mid 2032), driven primarily by unknown model limitations and mistakes and the potential for data bottlenecks that we aren’t modeling. In summary:

  1. Unknown model limitations and mistakes. With our previous (AI 2027) timelines model, my instinct was to push my overall forecasts longer due to unknown unknowns, and I’m glad I did. My median for SC was 2030 as opposed to the model’s output of Dec 2028, and I now think that the former looks more right. I again want to lengthen my overall forecasts for this reason, but less so because our new model is much more well-tested and well-considered than our previous one, and is thus less likely to have simple bugs or unknown simple conceptual issues.
  2. Data bottlenecks. Our model implicitly assumes now that any data progress is proportional to algorithmic progress. But data in practice could be either more or less bottlenecking. My guess is that modeling data would lengthen timelines a bit, at least in cases where synthetic data is tough to fully rely upon.

I will also increase the 90th percentile from 2062. My all-things-considered distribution is: 10th percentile 2027.5, 50th percentile 2032.5, 90th percentile 2085. You can see all of the adjustments that I considered in this supplement.

Now I’ll move on to takeoff.

To get my all-things-considered views I: increase the chance of fast takeoff a little (I change AC to ASI in <1 year from 26% to 30%), and further increase the chance of <3 year takeoffs year takeoffs (I change the chance of AC to ASI in <3 years from 43% to 60%).

The biggest reasons I make my AI-R&D-specific takeoff a bit faster are:

  1. Automation of hardware R&D, hardware production, and general economic automation. We aren’t modeling these, and while they have longer lead times than software R&D, a year might be enough for them to make a substantial difference.
  2. Shifting to research directions which are less compute bottlenecked might speed up takeoff, and isn’t modeled. Once AI projects have vast amounts of labor, they can focus on research which loads more heavily on labor relative to experiment compute than current research.

(1) leads me to make a sizable adjustment to the tail of my distribution. I think modeling hardware and economic automation would make it more likely that if there isn’t taste-only singularity, we still get to ASI within 3 years.

I think that, as with timelines, for takeoff unknown limitations and mistakes in expectation point towards things going slower. But unlike with timelines, there are counter-considerations that I think are stronger. You can see all of the adjustments that I considered in this supplement.

Daniel

First, let me say a quick prayer to the spirit of rationality, who infrequently visits us all:

On the subject of timelines, I don’t immediately know whether my all-things-considered view should be more or less bullish than the model. Here are a few considerations that seem worth mentioning to me:

  • First of all, this model is in-the-weeds / gearsy. (Some people might call it “inside-viewy” but I dislike that term.) I think it’s only appropriate to use models like this if you’ve already thought through more straightforward/simple considerations like “Is the phenomena in question [AGI] even possible at all? Do serious experts take it seriously? Are there any obvious & solid arguments for why this is a nothingburger?” I have thought through those kinds of things, and concluded that yes, AGI arriving in the next decade seems a very serious possibility indeed, worthy of more gearsy investigation. If you disagree or are curious what sorts of considerations I’m talking about, a partial list can be found in this supplement.
  • I think this model is the best model of AI R&D automation / intelligence explosion that currently exists, but this is a very poorly understood phenomenon and there’s been very little attention given to it, so I trust this model less when it comes to takeoff speeds than I do when it comes to timelines. (And I don’t trust it that much when it comes to timelines either! It’s just that there isn’t any single other method I trust more…)
  • I notice a clash between what the model says and my more intuitive sense of where things are headed. I think probably it is my intuitions that are wrong though, which is why I’ve updated towards longer timelines; I’m mostly just going with what the model says rather than my intuitions. However, I still put some weight on my intuitive sense that, gosh darn it, we just aren’t more than 5 years away from the AC milestone – think about how much progress has happened over the last 5 years! Think about how much progress in agentic coding specifically has happened over the last year!
  • More detail on vibes/intuitions/arguments:
    • I’ve been very unimpressed by the discourse around limitations of the current paradigm. The last ten years have basically been one vaunted limitation after another being overcome; Deep Learning has hit a wall only in the sense that Godzilla has hit (and smashed through) many walls.
    • However, two limitations do seem especially plausible to me: Online/continual learning and data efficiency. I think there has been some progress in both directions over the past years, but I’m unclear on how much, and I wouldn’t be that surprised if it’s only a small fraction of the distance to human level.
    • That said, I also think it’s plausible that human level online/continual learning is only a few years away, and likewise for data-efficiency. I just don’t know. (One data point: claim from Anthropic researcher)
    • Meanwhile, I’m not sure either of those things are necessary for AI R&D to accelerate dramatically due to automation. People at Anthropic and OpenAI already report that things are starting to speed up due to AI labor, and I think it’s quite plausible that massively scaled-up versions of current AI systems (trained on OOMs more diverse RL environments, including many with OOMs longer horizon lengths) could automate all or almost all of the AI R&D process. The ability to learn from the whole fleet of deployed agents might compensate for the data-inefficiency, and the ability to manage huge context window file systems, update model weights regularly, and quickly build and train on new RL environments might compensate for lack of continual learning.
    • And once AI accelerates dramatically due to automation, paradigm shifts of the sort mentioned above will start to happen soon after.
    • Summing up: Qualitatively, my intuitive sense of what’s going to happen in the next few years is, well, basically the same sequence of events described in AI 2027, just maybe taking a year or two longer to play out, and with various other minor differences (e.g. I don’t expect any one company to have as much of a lead as OpenBrain does in the scenario).
  • I’m also quite nervous about relying so much on the METR horizon trend. I think it’s the best single source of evidence we have, but unfortunately it’s still pretty limited as a source of evidence.
    • It is uncertain how it’ll extrapolate into the future (exponential or superexponential? If superexponential, how superexponential? Or should we model new paradigms as a % chance per year of changing the slope? What even is the slope right now, it seems to maybe be accelerating recently?)
    • …and also uncertain how to interpret the results (is a 1 month 80% horizon enough? Or do we need 100 years?).
    • There are also some imperfections in the methodology which complicate things. E.g. if I understand correctly the human baseliners for the various tasks were not of the same average skill level, but instead the longer-horizon tasks tended to have higher-skill human baseliners. Also, the sigmoid fit process is awkwardly non-monotonic, meaning there are some cases in which a model getting strictly better (/worse) at some bucket of tasks can decrease (/increase) its METR-reported horizon length! My guess is that these issues don’t make a huge difference in practice, but still. I hope that a year from now, it becomes standard practice for many benchmark providers to provide information about how long it took human baseliners to complete the tasks, and the ‘skill level’ of the baseliners. Then we’d have a lot more data to work with.
    • Also, unfortunately, METR won’t be able to keep measuring their trend forever. It gets exponentially more expensive for them to build tasks and collect human baselines as the tasks get exponentially longer. I’m worried that by 2027, METR will have basically given up on measuring horizon lengths, which is scary because then we might not be able to tell whether horizon lengths are shooting up towards infinity or continuing to grow at a steady exponential pace.
    • I think a much better trend to extrapolate, if only we had the data, would be coding uplift. If we had e.g. every 6 months for the past few years a high-quality coding uplift study, we could then extrapolate that trend into the future to predict when e.g. every engineer would be a 10x engineer due to AI assistance. (Then we’d still need to predict when research taste would start to be noticeably uplifted by AI / when AIs would surpass humans in research taste; however, I think it’s a reasonable guess right now that when coding is being sped up 10x, 100x, etc. due to highly autonomous AI coding agents, research taste should be starting to improve significantly as well.[20] At least I feel somewhat better about this guess than I do about picking any particular threshold of METR horizon length and guessing that it corresponds to a particular level of experiment selection skill, which is what we currently do.)
  • Relatedly, I’m also interested in the simple method of extrapolating AI revenue growth trends until AI revenue is most of the world economy. That seems like a decent proxy for when AGI will be achieved. I trust this method less than our model for obvious reasons, but I still put some weight on it. What does it say? Well, it says “Early 2030s.” OK.
  • I’m also interested in what our model says with a pure exponential trend extrapolation for METR instead of the superexponential (I prefer the superexponential on theoretical grounds, though note also that there seems to be a recent speeding up of the METR trend and a corresponding speedup in the trend on other benchmarks). Pure exponential trend, keeping my other parameters fixed, gets to AC 5 years later, in 2034. That said, if we use the more recent ~4 month doubling time that seems to characterize the RL era, even an exponential trend gets to AC in 2030, keeping other parameters fixed. I’m not sure I should keep my other parameters fixed though, particularly the AC coding time horizon requirement seems kinda up in the air since the change to exponential slope corresponds to a change in how I interpret horizon lengths in general.[21]
    • One factor weighing on my mind is the apparent recent speedup in AI capabilities progress–e.g. the slope of the METR trend seems notably higher since 2024 than it was before. This could be taken as evidence in favor of a (more) superexponential trend overall…
    • However, I’m currently leaning against that interpretation, for two reasons. First, the speedup in the trend isn’t just for the METR trend, it’s also for other benchmarks, which are not supposed to be superexponential. Secondly, there’s another very plausible explanation for what’s going on, which is that starting in 2024 the companies started scaling up RL a lot. But they won’t be able to keep scaling it at the same pace, because they’ll run into headwinds as RL becomes the majority of training compute. So on this view we should expect the rate of growth to revert towards the long-run average starting about now (or however long it takes for RL compute to become the majority of total training compute).
    • That said, I still think it’s plausible (though not likely) that actually what we are seeing is the ominous uptick in the rate of horizon length growth that is predicted by theory to happen a year or two before horizon lengths shoot to infinity.
  • Also, like Eli said above, I feel that I should err on the side of caution and that for me that means pushing towards somewhat longer timelines.
  • Finally, I have some private info which pushes me towards somewhat shorter timelines in expectation. My plan is to circle back in a month or three when more info is available and update my views then, and I currently expect this update to be towards somewhat shorter timelines though it’s unclear how much.

Weighing all these considerations, I think that my all-things-considered view on timelines will be to (1) push everything back one year from what the model says. So, my median for automated coder milestone 2030 instead of 2029, my median for superhuman AI researcher milestone 2031 instead of 2030.

In addition to that, I’ll (2) increase the uncertainty in both directions somewhat, so that there’s a somewhat greater chance of things going crazy in the next year (say, 9% by EOY 2026) and also a somewhat greater chance of things taking decades longer (say, still 6% that there’s no AGI even in 2050).

So, here’s my all-things-considered distribution as of today, Dec 30 2025:

On takeoff speeds:

I think my thoughts on this are pretty similar to Eli’s, modulo differences implied by our different parameter settings. Basically, take what the model (with my parameters) says, and then shift some probability mass away from the slower end and put it on the faster end of the range.

Also, whereas our model says that takeoff speeds are correlated with timelines such that shorter timelines also tends to mean faster takeoff, I’m not sure that’s correct and want to think about it more. There’s a part of me that thinks that on longer timelines, takeoff should be extremely fast due to the vast amounts of compute that will have piled up by then and due to the compute-inefficiency of whatever methods first cross the relevant thresholds by then.

So here’s a quick distribution I just eyeballed:

What info I’ll be looking for in the future & how I’ll probably update:

  • Obviously, if benchmark trends (especially horizon length) keep going at the current pace or accelerate, that’ll be an update towards shorter timelines. Right now I still think it’s more likely than not that there’ll be a slowdown in the next year or two.
  • I’m eager to get more information about coding uplift. When we have a reliable trend of coding uplift to extrapolate, I’ll at the very least want to redo my estimates of the model parameters to fit that coding uplift trend, and possibly I’d want to rethink the model more generally to center on coding uplift instead of on horizon length.
  • If AI revenue growth stays strong (e.g. 4xing or more in 2026) that’s evidence for shorter timelines vs. if it only grows 2x or less that’s evidence for longer timelines.
  • I’m eager to get more information about the ‘slope’ of the performance-as-a-function-of-time graph for various AI models, to see if it’s been improving over time and how far away it is from human performance. (See this discussion) This could potentially be a big update for me in either direction.
  • As for takeoff speeds, I’m mostly interested in thinking more carefully about that part of our model and seeing what improvements can be made.[22] I don’t think there’ll be much empirical evidence one way or another in the next year. Or rather, I think that disputes about the proper way to model takeoff matter more than evidence about the value of various parameters, at this stage. That said, I’ll be keen to get better estimates of some of the key parameters too.
  • Of course I’m also interested to hear the feedback/criticism/etc. from others about the model and the parameters and the overall all things considered view. I wouldn’t be surprised if I end up changing my mind significantly on the basis of arguments I haven’t thought of yet.
  • …this list is nowhere near exhaustive but that’s enough for now I guess.
Comparison to our previous (AI 2027) timelines and takeoff models

These sections focus specifically on the model results with Eli’s parameter estimates (for both the AI Futures Model and the AI 2027 model).

Timelines to Superhuman Coder (SC)

This section focuses on timelines to superhuman coder (SC), which was our headline milestone in our AI 2027 timelines model: an SC represents an AI that autonomously is as productive as an AGI project modified to have all coders as competent as their best, speeding them each up by 30x, and getting 30 copies of each of them.[23]

We’ll discuss only the AI 2027 time horizon extension model in this section, due to it being simpler than the benchmarks and gaps version.[24] Below we compare the forecasted distribution of the AI 2027 model against that of the AI Futures Model.

We see that the AI Futures Model median is 4 years later than the AI 2027 model, and that it assigns a 11% chance that SC happens before the time horizon extension’s median. From now onward, we will focus on the trajectory with median parameters rather than distributions of SC dates, for ease of reasoning.

The AI 2027 time horizon extension model, with parameters set to their median values, predicts SC in Jan 2027 given superexponential-in-effective-compute time horizon growth, and SC in Sep 2028 given exponential time horizon growth. Meanwhile, the new model with median parameters predicts SC in Feb 2032. This is a 3.5-5 year difference! From now on we’ll focus on the 5 year difference, i.e. consider superexponential growth in the time horizon extension model. This is a closer comparison because in our new model, our median parameter estimate predicts superexponential-in-effective-compute time horizon growth.

The biggest reason for this difference is that we model pre-SC AI R&D automation differently, which results in such automation having a much smaller effect in our new model than in the AI 2027 one. The 5 year increase in median comes from:

  1. Various parameter estimate updates: ~1 year slower. These are mostly changes to our estimates of parameters governing the time horizon progression. Note that 0.6 years of this is from the 80% time horizon progression being slower than our previous median parameters predicted, but since we are only looking at 80% time horizons we aren’t taking into account the evidence that Opus 4.5 did well on 50% time horizon.
  2. Less effect from AI R&D automation pre-SC: ~2 years slower. This is due to:
    1. Taking into account diminishing returns: The AI 2027 timelines model wasn’t appropriately taking into account diminishing returns to software research. It implicitly assumes that exponential growth in software efficiency is not getting “harder” to achieve, such that if AIs gave a software R&D uplift of 2x in perpetuity, the software efficiency growth rate would speed up by 2x in perpetuity. We didn’t realize this implicit assumption and have now fixed it.
    2. Less AI software R&D uplift from pre-SC AIs: The interpolation method used to get AI software R&D uplift values in the AI 2027 model in between present day and SC gave much higher intermediate values than the uplift we end up with in our new model. We previously modeled 50% of the way to SC in effective compute OOMs as resulting in 50% of the way to SC in terms of log(uplift), but our new model is more pessimistic. Partially, this is because the AI 2027 model had a bug in how AI software R&D was interpolated between present AIs and SC.. But that only accounts for half of the difference, the other half comes from us choosing an interpolation method that was more optimistic about pre-SC speedups than the AI Futures Model.
  3. Compute and labor input time series adjustments: ~1 year slower. That is, we now project slower growth in the leading AI project’s compute amounts and in their human labor force. Read about the AI Futures Model’s input time series here.
  4. Modeling experiment compute: ~1 year slower. Previously we were only modeling labor as an input to software progress, not experiment compute.

You can read more about these changes and their effects in our supplementary materials.

Takeoff from Superhuman Coder onward

The AI Futures Model predicts a slower median takeoff than our AI 2027 takeoff model. Below we graph each of their forecasted distributions for how long it will take to go from SC to ASI.

We see that while the AI Futures Model’s median is longer than the AI 2027 one, it still puts 45% probability of takeoff as fast as AI 2027’s median. On the other hand, the AI Futures Model predicts a higher chance of takeoff within 10 years, 20 years, etc. Our new model is less “binary” in the sense that it gives lower probability to very fast or very slow takeoffs. This is because the AI Futures Model models compute increases.[25]

The reason the AI Futures Model model gives a lower chance of fast takeoffs is primarily that we rely on a new framework for estimating whether there’s an SIE and how aggressive it is.

Our AI 2027 takeoff model predicted the progression of capabilities post-SC. Its methodology was also fairly simple. First, we enumerated a progression of AI capability milestones, with a focus on AI R&D capabilities, though we think general capabilities will also be improving. Then, for each gap between milestones A and B, we:

  1. Human-only time: Estimated the time required to go from milestone A to B if only the current human labor pool were doing software research.
  2. AI R&D progress multiplier (what we now call AI software R&D uplift, or just AI R&D uplift): Forecasted how much AI R&D automation due to each of milestones A and B will speed up progress, then run a simulation in which the speedup is interpolated between these speedups over time to get a forecasted distribution for the calendar time between A and B.


In order to estimate some of the human-only time parameters, the AI 2027 takeoff forecast relied on a parameter it called r, which controlled the diminishing returns to AI R&D. It was crudely estimated by backing out the implied r from the first human-only time requirement, which was to get from SC to SAR.

The AI 2027 model assumed that there were no compute increases; under this assumption, if it r>1 then successive doublings of AI R&D uplift (what we previously called progress multiplier) gets faster over time after full AI R&D automation. Others have referred to this possibility as a software intelligence explosion (SIE). In the model, each doubling took about 0.7x as long as the previous: we’ll call the ratio of successive uplift doublings b from here onward, i.e. b<1 means successive doublings are faster and we get an SIE.[26]

In the AI Futures Model, the condition for an SIE is more complicated because we model multiple types of AI R&D; we also include compute increases, departing significantly from the behavior of an SIE. That said, there is a similar understandable concept in our model: a taste-only singularity (TOS). This is the situation in which after full AI R&D automation and with only research taste improvements (no extra coding or compute), successive doublings of AI R&D uplift get faster over time. To make the analysis much simpler, we also ignore the limits of intelligence in our analysis; these usually don’t greatly affect the takeoff to AIs, but they do slow progress down somewhat.

Under these assumptions, we can define a similar b to that analyzed in an SIE.

We estimate b by combining the following parameters:[27]
(a) the ratio of top to median researchers' value per selected experiment
(b) how quickly AIs improve at research taste as effective compute increases
(c) the rate at which software R&D translates into improved software efficiency (intuitively, the rate at which ideas are getting harder to find).

When using this framework, we get a less aggressive result (with our median parameters). Given that (a) was explicitly estimated in the AI 2027 model, and that we have a fairly aggressive estimate of (c) in the new model, implicitly most of the difference in results are coming from (b), how quickly AIs improve at research taste. We estimated this in our new model by looking at historical data on how quickly AIs have moved through the human range for a variety of metrics (more on that here).

With the AI 2027 model’s median parameters, each successive doubling of uplift took roughly 66% of the length of the previous (i.e. b=0.7).[28] The AI Futures Model’s distribution of b is below.

In the AI Futures Model model in the median case, there isn’t a TOS: each doubling would take 20% longer than the previous if taste were the only factor.[29] But we have high uncertainty: 38% of our simulations say that successive doublings get faster, and 17% are at least as aggressive as the AI 2027 model (i.e. b<0.7).[30]

Remember that unlike the AI 2027 model, the AI Futures Model models compute increases; also in practice coding automation contributes some to takeoffs.[31] Therefore, at similar levels of the separate bs we’ve defined here, takeoff in the AI Futures Model is faster.

Faster takeoffs are also correlated in our model with shorter timelines: when we filter for simulations that achieve SC in 2027, 35% of them have a b lower than the AI 2027 model’s median parameters. This is because some parameters lead to larger effects from automation both before and after SC, and furthermore we specified that there be correlations between parameters that govern how quickly coding abilities improve, and how quickly research taste abilities improve.

For further analysis of the differences between our AI 2027 and new takeoff models, see our supplementary materials.

  1. AGI stands for Artificial General Intelligence, which roughly speaking means AI that can do almost everything. Different people give different definitions for it; in our work we basically abandon the term and define more precise concepts instead, such as AC, SIAR, TED-AI, etc. However, we still use the term AGI when we want to vaguely gesture at this whole bundle of concepts rather than pick out one in particular. For example, we’ve titled this section “AGI timelines…” and the next section “Post-AGI takeoff…” because this section is about estimating how many years there’ll be until the bundle of milestones starts to be reached, and the next section is about estimating what happens after some of them have already been reached. ↩︎

  2. 2047 for “unaided machines outperforming humans in every possible task”, and 2116 for “all human ↩︎

  3. Some have also done extrapolations of Gross World Product, such as David Roodman’s Modeling the Human Trajectory. ↩︎

  4. More details: ↩︎

  5. Technically, the report predicted the arrival of Transformative AI, or TAI, which was defined as having at least as big of an impact as the Industrial Revolution. ↩︎

  6. Rule of thumb inspired by Lindy’s Law: It’s reasonable to guess that a trend will continue for about as long as it’s been going so far. We wouldn’t dream of confidently extrapolating this trend for thirty years, for example. (We do in fact run the model into the 2050s and onward in our Monte Carlos, but we acknowledge that the probability of reality diverging dramatically from the model increases with the duration of the extrapolation.) ↩︎

  7. Peter Wildeford has a model which has the possibility of doublings getting easier or harder, but does not model AI R&D automation or changes to labor or compute growth. ↩︎

  8. See also: Most AI value will come from broad automation, not from R&D | Epoch AI ↩︎

  9. GATE and the Full Takeoff Model also model the progression after full AI R&D automation, but neither of their authors claim that their model is intended to do it well. ↩︎

  10. These estimates are then shaded up to account for capability improvements at the same compute level in addition to efficiency improvements at the same performance level. This adjustment brings the methodology closer to ours, but still we think it’s helpful to focus specifically on research taste skills. And finally, in Davidson and Houlden, everything is converted to the units of gains in the number of parallel workers, which we view as a much less natural unit than research taste quality. ↩︎

  11. Among other advantages of having an integrated model: our model itself already bakes in most of the various adjustments that Davidson and Houlden did ad-hoc to their estimate of r, and we can generally ensure reasonable starting conditions (as opposed to Davidson and Houlden’s gradual boost). ↩︎

  12. Our model operationalizes AC as follows: An AC, if dropped into present day, would be as productive on their own as only human coders with no AIs. That is, you could remove all human coders from the AGI project and it would go as fast as if there were only human coders. The project can use 5% of their compute supply to run ACs. ↩︎

  13. See especially this Anthropic survey of researchers claiming >100% productivity improvements, but also this METR uplift study which found that people systematically overestimate the amount of uplift they were getting from AI assistance. ↩︎

  14. That is, if we think that eventually there will be an AI system which outperforms humans at all horizon lengths, then that means the trend must shoot to infinity in finite time. ↩︎

  15. That is, the part of our model that deals with AI timelines, i.e. the length of the period leading up to the “automated coder” milestone, centrally involves the METR trend. After that milestone is reached, horizon length continues to increase but isn’t directly relevant to the results. The results are instead driven by increases in automated research taste and coding automation efficiency. ↩︎

  16. Our model operationalizes SAR as follows: if dropped into an AGI project in present day, a SAR would be as good at research taste as if there were only human researchers, who were each made as skilled as the top researcher. ↩︎

  17. What do we mean when we say that the gap between a top human researcher and SIAR is 2x greater than that between the median and top human researcher? We mean the following. First, let’s define a transformation between AIs’ capability level b and a number of SDs relative to the median as: ↩︎

  18. Our model operationalizes TED-AI as follows: A TED-AI is an AI system that could, if dropped into the present day & given the resources of a large tech company & three months to prep, fully automate 95% of remote work jobs in the US. It need not be able to do all 95% at the same time (perhaps there isn't enough compute to run enough copies of the TED-AI for that), but it needs to be able to do any 10% of them using only 50% of the US's AI-relevant compute. ↩︎

  19. Our model operationalizes ASI as follows: An ASI would, if dropped into present day & given the resources of a large tech company & three months to prep, be able to fully automate 95% of remote work jobs in the US to the level where it is qualitatively 2x as much above the best human as the best human is above the median professional. Also, here we define “the median professional” not as the actual median professional but rather as what the the median professional would be, if everyone who took the SATs was professionally trained to do the task. (We standardize the population that is trained to do the task because otherwise the ASI requirement might be quite different depending on the population size and competence levels of the profession. See above regarding how we define the 2x gap.) ↩︎

  20. Spot-checking in our model: Serial coding labor multiplier is basically the square root of parallel coding labor multiplier, and so when I look at my default parameter settings at the point where serial coding labor multiplier is ~10x (May 2030) the AIs have research taste equivalent to the median AI company researcher. Sounds about right to me. ↩︎

  21. I’ve talked about this elsewhere but I generally think that if you don’t like using a superexponential and insist on an exponential, you need to come up with a different interpretation of what it means for a model to have horizon length X, other than the natural one (“A model has horizon length X iff you are better off hiring a human for coding tasks that take humans much longer than X, but better off using the model for coding tasks that take humans much less than X.”) Because on that interpretation, an exponential trend would never get to a model which outperforms humans at coding tasks of any length. But we do think that eventually there will be a model which outperforms humans at tasks of any length. In other words, on the natural interpretation the trend seems likely to go to infinity in finite time eventually. You can try to model that either as a smooth superexponential, or as a discontinuous phase shift… even in the latter case though, you probably should have uncertainty over when the discontinuity happens, such that the probability of it happening by time t increases fairly smoothly with t. ↩︎

  22. For example, I want to think more about serial speed bottlenecks. The model currently assumes experiment compute will be the bottleneck. I also want to think more about the software-only-singularity conditions and whether we are missing something there, and square this with soft upper bounds such as “just do human uploads.” ↩︎

  23. Note that with the new model, we’ve moved toward using Automated Coder (AC) as the headline coding automation milestone, which has a weaker efficiency requirement. ↩︎

  24. That said, we note that the benchmarks and gaps version had longer median SC timelines (Dec 2028). And Eli’s all-things-considered SC median was further still in 2030, though Daniel’s was 2028. ↩︎

  25. That said, we still think that the AI Futures Model gives too low a probability of <10 year takeoffs, because we are not modeling growth in compute due to hardware R&D automation, hardware production automation, or broad economic automation. ↩︎

  26. As discussed here, the AI 2027 model set r=2.77 and 1.56 at different points. b=2^(1/r-1), so b=0.64 to 0.78. ↩︎

  27. See here for a more thorough explanation of how b is calculated from our new model’s parameters. ↩︎

  28. 2^((1/2)-1) gives roughly 0.7. See how we got these numbers here. ↩︎

  29. 2^((0.315/0.248)-1). See the justification for this formula on our website. ↩︎

  30. Note that the minimum b in our model is 0.5. This is a limitation, but in practice, we can still get very fast takeoffs. For example, if b were 0.5 and didn’t change over time, this would lead to a finite-time singularity in 2 times longer than the initial uplift doubling time. ↩︎

  31. This could also be influenced by the uplifts being different for different milestones, or other factors. Unfortunately we haven’t had a chance to do a deep investigation, but a shallow investigation pointed toward compute increases being the primary factor. ↩︎



Discuss

Lumenator 2.0

31 декабря, 2025 - 23:48
Published on December 31, 2025 8:48 PM GMT

Late in 2019, I, like many of my rationalist friends purchased the parts for and assembled a genuine, bona fide LUMENATOR™️ - a device for greatly increasing the brightness of your home - according to the original specification. To me, lumenators are the quintessential application of the More Dakka mindset: when you face a problem that responds positively to a little bit of X and responds even more positively to larger amounts of X, you don't just stop applying X once you feel you've done a reasonable amount of it, you add more and more and more until your problem goes away or X stops working. I built a lumenator not for seasonal affective disorder, but because it helps me wake up feeling refreshed in the morning and I feel happy when my house is very bright inside. In 2019 I lived in a small group house in Waterloo, ON and we'd often give people directions to our house like "turn the corner and then look for the one with ridiculously bright light streaming from the windows". They'd show up without trouble and remark: "Wow, I didn't actually expect I'd be able to find your place based on those directions".

I've brought my lumenator with me through 5 changes of address and still used it up until a few months ago. More recently I've felt that despite trying really hard, as a community we didn't More Dakka hard enough. When you really push the envelope on luminance there are a few limiting factors you run into: cost, power, and heat dissipation. Luckily for us, there's an industry that has massively ballooned in the days since Eliezer's original post and has created an industrial-scale demand signal for light sources that are super bright, about as compact as possible without being a fire hazard or requiring active cooling, and emit light that is spectrally similar to sunlight. Want to take a guess?

marijuana

The idea: mount one of these lights directly above my bed, put something in between to diffuse the light coming from the many tiny LEDs, and put it on a timer so it gradually brightens around the time I want to wake up. Here's my build:

  • $210: passively cooled 200 Watt lamp: SPIDER FARMER SF2000Pro, Full Spectrum Plant Grow Light, Dimmable
    • I was really tempted to go for the larger 450W or 650W options but saner minds prevailed (for now).
    • Note: I probably don't recommend buying this product for this purpose due to an insufficiently low minimum brightness, read on for details.
  • $70: Spider Farmer GGS Controller Kits (for timer based schedule)
    • I'm sure you could DIY a replacement for this, but I don't have time for that :)
  • $13: Photography Diffuser Fabric 78.7 x 59 Inches
    • Empirically, the fabric is almost imperceptibly warmed when mounted ~1.5 ft. from the light for several hours of continuous use, so I think there's minimal risk of starting a fire.
    • I also tried Diffusion Gels Filter Sheet Kit 15.7x19.6inches but these were too small. I found gel filter sheets to be significantly better at diffusing without attenuating though, so I'd shop for a larger version of this next time around.
  • ~$35: ceiling hooks to mount the light, and to mount the diffusion fabric, a grommeting kit, some fishing line, and a few command hooks.
    • I'd recommend you anchor your hooks in something stronger than drywall so you don't need to find out what it's like to be woken up by an 8 pound piece of metal falling on your face (I too am blissfully unaware of this).

Total: ~$330

At 200W the lamp offers a PPF of 540 µmol/s, but we're not plants and our eyes perceive some wavelengths as more or less bright. Accounting for luminous efficiency and the lamp's spectrum the manufacturer estimates we get about 53 lumens per µmol/s, or a total luminous power of about 30,000 lumens. With similar calculations Claude estimates the illuminance as about 4,000 lux @ 6 ft. or 33,000 lux @ 2 ft. Not bad at all!

Here's what it looks like without the diffusion filter:

And with:

Anecdotally it feels really bright, the pictures don't do it justice. I've configured it to turn on in the morning at minimum brightness and then increase to maximum over ten minutes. At maximum it doesn't feel quite like sunlight but doesn't feel like normal indoor lighting either; it feels more like basking in indirect sunlight in a cozy glade on a crisp summer day. My bedroom has pot lights installed that guests regularly complain about for being too bright, and if the lumenator is on you can barely tell the difference when I turn them on.

There's only one problem: the device can be set to brightness levels between 11% and 100% but not below 11%, and it turns out that 11% is still really bright! Bright enough to wake me up instantly when it clicks on. I'll be looking around for a similar light with a more dynamic range on the lower end.

Overall, it's been a very fun experiment and I'll likely continue using it despite the 11% problem because it feels really nice. If you're interested in trying it out for yourself I'd be happy to post more detailed instructions. Let me know.



Discuss

The Plan - 2025 Update

31 декабря, 2025 - 23:10
Published on December 31, 2025 8:10 PM GMT

What’s “The Plan”?

For several years now, around the end of the year, I (John) write a post on our plan for AI alignment. That plan hasn’t changed too much over the past few years, so both this year’s post and last year’s are written as updates to The Plan - 2023 Version.

I’ll give a very quick outline here of what’s in the 2023 Plan post. If you have questions or want to argue about points, you should probably go to that post to get the full version.

So, how’s progress? What are you up to?

2023 and 2024 were mostly focused on Natural Latents - we’ll talk more shortly about that work and how it fits into the bigger picture. In 2025, we did continue to put out some work on natural latents, but our main focus has shifted.

Natural latents are a major foothold on understanding natural abstraction. One could reasonably argue that they’re the only rigorous foothold on the core problem to date, the first core mathematical piece of the future theory. We’ve used that foothold to pull ourselves up a bit, and can probably pull ourselves up a little further on it, but there’s more still to climb after that.

We need to figure out the next foothold.

That’s our main focus at this point. It’s wide open, very exploratory. We don’t know yet what that next foothold will look like. But we do have some sense of what problems remain, and what bottlenecks the next footholds need to address. That will be the focus of the rest of this post.

What are the next bottlenecks to understanding natural abstraction?

We see two main “prongs” to understanding natural abstraction: the territory-first prong, and the mind-first prong. These two have different bottlenecks, and would likely involve different next footholds. That said, progress on either prong makes the other much easier.

What’s the “territory-first prong”?

One canonical example of natural abstraction comes from the ideal gas (and gasses pretty generally, but ideal gas is the simplest).

We have a bunch of little molecules bouncing around in a box. The motion is chaotic: every time two molecules collide, any uncertainty in their velocity is amplified multiplicatively. So if an observer has any uncertainty in the initial conditions (which even a superintelligence would, for a real physical system), that uncertainty will grow exponentially over time, until all information is wiped out… except for conserved quantities, like e.g. the total energy of the molecules, the number of molecules, or the size of the box. So, after a short time, the best predictions our observer will be able to make about the gas will just be equivalent to using a Maxwell-Boltzmann distribution, conditioning on only the total energy (or equivalently temperature), number of particles, and volume. It doesn’t matter if the observer is a human or a superintelligence or an alien, it doesn’t matter if they have a radically different internal mind-architecture than we do; it is a property of the physical gas that those handful of parameters (energy, particle count, volume) summarize all the information which can actually be used to predict anything at all about the gas’ motion after a relatively-short time passes.

The key point about the gas example is that it doesn’t talk much about any particular mind. It’s a story about how a particular abstraction is natural (e.g. the energy of a gas), and that story mostly talks about properties of the physical system (e.g. chaotic dynamics wiping out all signal except the energy), and mostly does not talk about properties of a particular mind. Thus, “territory-first”.

More generally: the territory-first prong is about looking for properties of (broad classes of) physical systems, which make particular abstractions uniquely well-suited to those systems. Just like (energy, particle count, volume) is an abstraction well-suited to an ideal gas because all other info is quickly wiped out by chaos.

What’s the “mind-first prong”?

Here’s an entirely different way one might try to learn about natural abstraction.

Take a neural net, and go train it on some data from real-world physical systems (e.g. images or video, ideally). Then, do some interpretability to figure out how the net is representing those physical systems internally, what information is being passed around in what format, etc. Repeat for a few different net architectures and datasets, and look for convergence in what stuff the net represents and how.

(Is this just interpretability? Sort of. Interp is a broad label; most things called “interpretability” are not particularly relevant to the mind-first prong of natural abstraction, but progress on the mind-first prong would probably be considered interp research.)

In particular, what we’d really like here is to figure out something about how patterns in the data end up represented inside the net, and then go look in the net to learn about natural abstractions out in the territory. Ideally, we could somehow nail down the “how the natural abstractions get represented in the net” part without knowing everything about what natural abstractions even are (i.e. what even is the thing being represented in the net), so that we could learn about their type signature by looking at nets.

More generally: the mind-first prong is about looking for convergent laws governing how patterns get “burned in” to trained/evolved systems like neural nets, and then using those laws to look inside nets trained on the real world, in order to back out facts about natural abstractions in the real world.

Note that anything one can figure out about real-world natural abstractions via looking inside nets (i.e. the mind-first prong) probably tells us a lot about the abstraction-relevant physical properties of physical systems (i.e. the territory-first prong), and vice versa.

So what has and hasn’t been figured out on the territory prong?

The territory prong has been our main focus for the past few years, and it was the main motivator for natural latents. Some key pieces which have already been nailed down to varying extents:

  • The Telephone Theorem: information which propagates over a nontrivial time/distance (like e.g. energy in our ideal gas example) must be approximately conserved.
  • Natural Latents: in the language of natural latents, information which propagates over a nontrivial time/distance (like e.g. energy in our ideal gas example) must be redundantly represented in many times/places - e.g. we can back out the same energy by looking at many different time-slices, or roughly the same energy by looking at many different little chunks of the gas. If, in addition to that redundancy, that information also mediates between time/space chunks, then we get some ontological guarantees: we’ve found all the information which propagates.
  • Some tricks which build on natural latents:
    • To some extent, natural latent conditions can nail down particular factorizations of high level summaries, like e.g. representing a physical electronic circuit as a few separate wires, transistors, etc. We do this by looking for components of a high-level summary latent which are natural over different physical chunks of the system.
    • We can also use natural latent conditions to nail down particular clusterings, like in A Solomonoff Inductor Walks Into A Bar.

… but that doesn’t, by itself, give us everything we want to know from the territory prong.

Here are some likely next bottlenecks:

  • String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
  • The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?
  • High level dynamics or laws, like e.g. circuit laws or gas laws. The natural latents machinery might tell us e.g. which variables should appear in high level laws/dynamics, but it doesn’t say much about the relationships between those variables, i.e. the laws/dynamics themselves. What general rules exist for those laws/dynamics? How can they be efficiently figured out from the low level? How can they be efficiently represented in full generality?
  • How can we efficiently sample the low-level given the high-level? Sure, natural latents summarize all the information relevant at long distances. But even with long-range signals controlled-for, we still don’t know how to sample a small low-level neighborhood. We would need to first sample a boundary which needs to be in-distribution, and getting an in-distribution boundary sample is itself not something we know how to do.
And what has and hasn’t been figured out on the mind prong?

The mind prong is much more wide open at this point; we understand it less than the territory prong.

What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.

We have some foundational pieces:

  • Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
  • Natural latents and the Telephone Theorem might also be relevant insofar as we view the net itself as a low-level system which embeds some high-level logic. But that also doesn’t get at the core question about how environment structure gets represented in the net.
  • There’s a fair bit to be said about commutative diagrams. They, again, don’t directly address the core representation question. But they’re one of the most obvious foundational tools to try, and when applied to neural nets, they have some surprising approximate solutions - like e.g. sparse activations.

… but none of that directly hits the core of the problem.

If you want to get a rough sense of what a foothold on the core mind prong problem might look like, try Toward Statistical Mechanics of Interfaces Under Selection Pressure. That piece is not a solid, well-developed result; probably it’s not the right way to come at this. But it does touch on most of the relevant pieces; it gives a rough sense of the type of thing which we’re looking for.

Mostly, this is a wide open area which we’re working on pretty actively.



Discuss

Safety Net When AIs Take Our Jobs

31 декабря, 2025 - 23:05
Published on December 31, 2025 8:05 PM GMT

I'm analyzing what happens to the US economy in the short-term aftermath of the typical job being replaced by AIs and robots. Will there be a financial crisis? Short answer: yes.

This is partly inspired by my dissatisfaction with Tomas Pueyo's analysis in If I Were King, How Would I Prepare for AI?.

Let's say 50% of workers lose their jobs at the same time (around 2030), and they're expected to be permanently unemployed. (I know this isn't fully realistic. I'm starting with simple models and will add more realism later.)

I'll assume that AI starts making the world more productive around the same time that this job loss occurs, but that big innovations such as a cheap cancer cures or the ability to conquer the world are still far enough in the future that financial markets aren't ready to price them in.

These assumptions are designed to help me analyze the effects of job loss with minimal complications from other effects of AI. I'm focused here on the short-term financial and political consequences of job losses. There will be some radically different longer-term consequences, but I'm only analyzing those here to the extent that I expect markets to reflect them at the time of the job losses.

This post is merely an outline of what a rigorous analysis would look like. It's good enough for informing my investment strategies, but not for persuading politicians to adopt better policies.

Note that this will be one of my least readable blog posts. Most of you should start by reading the conclusion, and only reading the rest if you're tempted to argue with my conclusions.

If you still think my conclusions are wrong, you can find some more detailed explanations of my reasoning in this conversation with Gemini.

Note that I'm targeting this at readers with a significant background in finance. Please question the details of my analysis, and produce competing guesses based on answering similar questions.

Conclusions

I expect turmoil similar to that of the pandemic. My median guess is that it will be somewhat less sudden than the crash of March 2020, and that markets will mostly recover in one to two years (assuming we have years before something more dramatic happens).

The financial turmoil is likely to induce political instability. I find that hard to predict.

The US government will need to be more competent than it was during the pandemic in order to avoid hyperinflation or defaulting on its debt.

The magnitude of the turmoil will likely be heavily influenced by hard-to-predict expectations.

Maybe a bright spot is that a financial crash could slow capability advances at roughly a time of near-maximum risk related to AI alignment. But that might be offset by politicians being too distracted to do anything competent about alignment.

I'm surprised at how much my outlook fluctuated while writing this post, between optimism and despair, before settling on an intermediate mood.

The process of writing this post convinced me to (slowly) start selling my remaining (small) positions in bank stocks. I'll be less willing to sell my stocks in gold mining companies. I'll probably be more willing to sell some of my other stocks when I've guessed that they've reached bubble levels, rather than hoping to sell close to the peak.

See my blog for the full post.



Discuss

2025 Year in Review

31 декабря, 2025 - 22:50
Published on December 31, 2025 7:50 PM GMT

It’s that time. It’s been a hell of a year.

At the start we barely had reasoning models. Now we have Claude Code and Opus 4.5.

I don’t code. Yet now I cause code to exist whenever something about a website annoys me, or when I get that programmer’s realization that there’s something I am planning on doing at least three times. Because why not?

The progress has simultaneously been mind bogglingly impressive and fast. But a lot of people don’t see it that way, because progress has been incremental, and because we were reasonably expecting to often get even more than this.

The public conversation and debate, even more than before, was full of false narratives and active attempts to make the situation worse. The same goes for attempts to shape Federal policy towards AI, and OpenAI’s conversion into a for-profit.

It’s been, as they say, one battle after another, with many wins, many setbacks and a lot of things in between.

This includes the key developments in AI, and also other blog posts from the year that I consider memorable looking back.

This is only our corner of the world’s Year in Review, not one in general, thus things like Liberation Day are happening in the background and go undiscussed.

January

The confusions started in January, as we prepared for Trump to take office.

OpenAI had just given us o1-preview, the first reasoning model.

At the tail end of 2024, DeepSeek released v3, or The Six Million Dollar Model. This was a big advancement in open source and Chinese model capabilities, and showed that they were not as far behind as we thought they were, and also that damn good models could be trained on the cheap. Not as cheap as the headline number, since the six million was only direct costs of the final run, but still pretty cheap.

Then a few weeks later, DeepSeek gave us r1, a reasoning model based on v3. They wrapped this up into a nice clean free app experience, which included the first time most people could see a reasoning model’s chain of thought – Gemini Flash Thinking offered this too but almost no one knew about that or cared. This showed that the ‘secret sauce’ of building a reasoning model was not so difficult to copy, and the marginal costs of doing so were low.

DeepSeek shot to the top of the App store, and the world completely lost its mind. The stock market mini-crashed. People talked about how China had ‘caught up’ to America, or this meant inference would be so cheap no one would need Nvidia chips (as consumers rushed out to buy Nvidia chips to run DeepSeek r1), or how it would destroy margins and drive American AI out of business. I had to warn people, many times, with the classic advice: Don’t Panic, and I went on Odd Lots to discuss it all.

Collectively this was called The DeepSeek Moment.

White House rhetoric talked about how this meant we were in a ‘race’ with China, so of course any other considerations than ‘winning’ must be thrown out the window.

With time, those paying attention realized all of that was overblown. DeepSeek was impressive as a lab, and v3 and r1 were excellent models, but still on the order of eight months behind OpenAI, Anthropic and Google. We had been comparing the relatively best features of r1 on their own, and then using that to project into the future, which flat out did not happen. This happened at a crucial inflection point, right when reasoning models had started, which was when a tiny amount of compute could go a maximally long way.

Later on, r1-0528 did not have a moment, nor did DeepSeek 3.1 or DeepSeek 3.2.

February

Google started out the month introducing us to Deep Research, a new product form that would be copied by OpenAI, allowing the AI to take time to prepare a report. At the time, this was super impressive. It definitely has its uses, even if the timing is awkward and you have to push past the tendency to pad reports with a lot of slop.

A new paper on The Risk of Gradual Disempowerment From AI improved the debate by highlighting a central way that humans end up not being in charge. There doesn’t need to be some ‘AI coup’ or battle, the AIs will by default end up with more and more resources and power unless something stops this from happening. One day we wake up and realize we are not in control. Another day after that we don’t wake up.

OpenAI declared that its primary alignment strategy would be Deliberative Alignment, so I analyzed that approach. I think it is helpful, but not a central solution.

The Administration made its AI feelings clear at The Paris AI Anti-Safety Summit. Previous summits had been efforts to lay foundation for international cooperation, with serious discussions of existential risks, in particular with The Bletchley Declaration. That was clearly over, transformed into a disdain for the idea that sufficiently advanced AI could be existentially dangerous, and Vance giving a speech demanding suicidal accelerationism and warning against attempts to not die.

The year would play out in similar fashion. We had some modest success in California and New York, but the White House would, under the influence of David Sacks, become an active force for interference with efforts to not die, and later even to beat China. They would do some pro-America things along the way, but also things that actively interfered with our competitiveness.

I introduced a key new concept handle which I call Levels of Friction. Different actions are variously harder or easier, from both practical and legal perspectives, to do. They range from Level 0 (defaults or requirements), to Level 1 (legal and ubiquitous and easy), Level 2 (safe but annoying), Level 3 (actively tricky or risky), Level 4 (actually seriously illegal) up to Level 5 (we really care about stopping you). Instead of thinking of a boolean of legal-illegal or possible-impossible, it is often more enlightening to consider moving between levels.

AI is going to move a lot of things to lower levels of friction. That is by default bad, but frictions can be load bearing, such as with job applications or limiting antisocial behaviors. It protects the commons. We will have to adjust quite a lot of things once key frictions are removed from the system.

February was the peak of ‘could Grok be a thing?’ It turned out not to be a thing. In other model news we got Claude 3.7.

We also got our first introduction to Emergent Misalignment, the idea that training the AI to do bad things associated with evil could lead it to generalize into thinking of itself as trope-style evil and doing a wide range of trope-style evil things.

March

A non-AI highlight was my piece on elementary education, School Is Hell.

GPT-4.5 was OpenAI’s attempt to give us a large and slow model. It did some cool things, and there are people that really liked it, but mostly it wasn’t worthwhile.

A big part of AI coverage is getting confident in dismissing hype. A great example of this was my coverage of The Manus Marketing Madness. Now that they’ve unceremoniously sold out to Meta, it’s easy to forget that a lot of people were hyping Manus as The Next Big Thing, as well as the next reason we would Lose To China.

I warned against using The Most Forbidden Technique, which is where you use interpretability to train on intermediate outputs, to teach it to think the thoughts you want it to think, thus teaching the AI to, like humans before it, hide its thinking.

Image generation had its first big moment, when the 4o image generator came online and everyone went Studio Ghibli crazy, taking advantage of both the advancement in quality and willingness to mimic styles.

Gemini 2.5 Pro came out, which I called the new state of the art. I think this was correct at the time, but later versions of Gemini 2.5 Pro were actively worse, and soon OpenAI would be back out ahead.

April

AI 2027 provided an illustrative scenario that presented a best guess as to what was likely to happen, with an alternative scenario option where things turn out well because a bold decision is made to slow down at a key moment. Scott Alexander and Daniel Kokotajlo explained the details on the Dwarkesh podcast, and I covered various responses.

Llama 4 was released, and turned out to be a total dud. Meta has been silent since in terms of topline AI products, while spending hundreds of millions on individual pay packages to try and gather the talent to get back in the game. It is a good thing Meta is struggling, given its bizarrely dystopian AI vision it is willing to give in public.

o3 put OpenAI firmly back out in front in reasoning, with excellent tool use, but was rapidly exposed as a Lying Liar that lies a lot.

OpenAI had other problems with GPT-4o. It was always an absurd sycophant that could get some of its users into trouble, but updates made around this time made it even more of an absurd sycophant, forcing a reversion to a previous build. I would later offer a postmortem.

May

OpenAI claimed that their conversion to a for-profit, which as announced then would clearly have been one the biggest thefts in human history, would leave the non-profit in control.

The White House had from the beginning made a huge deal out of how Just Awful the Biden diffusion rules were, just like it talks about everything Biden did, but it initially acted generally wisely on chip diffusion and export controls, including on the H20.

Alas, over time David Sacks got more control over their narrative and increasingly started spouting Obvious Nonsense About AI Diffusion, literally claiming that ‘beating China’ means maximizing Nvidia’s share of chip sales, and warning that China would step in with non-existent and otherwise greatly inferior AI chips to build its own ‘AI tech stack’ if we didn’t sell massive compute to partners with questionable loyalties. Initially this rhetoric and action was confined to sales to parties like UAE and KSA, where a case can be made if the deals and safeguards are good, and details matter. Later this would extend to trying to sell chips to China directly.

OpenAI released Codex to compete with Claude Code. Claude Code was such a stealth release, initially a side project of one employee, that it took a while to notice something was happening, and even longer for me to finally give it a try. Nowadays Claude Code might be most of my AI token usage.

Claude 4 put Anthropic back in the game.

I offered thoughts on those who use AI to cheat, especially in education.

Veo 3 gave Google the lead in video generation.

I wrote my first ‘Letting Kids Be Kids,’ I would later write another in December.

June

Dating Roundup #6 proved popular, and #7 did solidly too. I just put out #8 and #9.

I did an analysis of New York’s proposed RAISE Act, by Alex Bores who is now running for Congress. I concluded it was an excellent bill. It would later pass, although in somewhat weakened form because of Governor Hochul’s changes.

OpenAI and in particular Sam Altman continued to try and sell us on the concept of a Gentle Singularity, that AIs would become superintelligent and your life wouldn’t much change. This is of course Obvious Nonsense. Your life might become great, or it might end, or it might get into High Weirdness, but it won’t stay the same.

o3 Pro came out, and was very strong and not the lying liar that normal o3 was.

I came out with my (hopefully annual from here on in) blog recommendations.

July

The first attempt to pass a federal moratorium on AI regulation, as in tell the states they aren’t allowed to regulate AI because that should be federal while also not regulating AI at the federal level, came dangerously close to passing as part of the BBB. It was ultimately stripped out 99-1 once the tide had turned.

Congress had one of its finer hearings, where they asked good questions about AI.

Grok ran into trouble. No, Grok, No. Do not call yourself MechaHitler. Or worse.

Kimi K2 was an unusually impressive new open Chinese model. We would later get Kimi K2 Thinking in November.

Google and OpenAI got IMO Gold.

AI companions were getting a lot of attention, which has since died down. This will be a big thing at some point, and for some it is a very real thing, but for now it isn’t good enough to hold most people’s interest. I followed up again in August.

August

The big hyped release of the year was of course GPT-5. This would be their big moment to unify all their crazy model variations and names, and create one model to rule them all, with a router to think longer if and only if that was worthwhile. There were approaching death stars and we saw a variety of assertive valueposting. It was the big version number jump, and people expected a lot.

GPT-5 was a good model, I found it to be a clear upgrade, but it very much did not live up to the hype. Many even strongly wanted to keep GPT-4o for its far friendlier and more empathic attitude, or some would say its sycophancy – the very features that make GPT-4o not a great thing for many users are alas the reasons users often like it so much. I covered the basic facts and model card, then outside reactions and finally created a synthesis.

Unfortunately, the model OpenAI chose to call GPT-5 being a disappointing release gave so many people, up to and including David Sacks and Sriram Krishnan at the White House, the wrong idea. There is a constant demand for data points that say AI won’t advance much, that scaling is dead, that it will all be a normal technology and you don’t have to worry about AGI. Washington seems to have come away from the GPT-5 release with this message, and it plausibly did great harm in numerous ways, including to our export controls.

I tried to push directly back against this, pointing out that AI was continuing to make rapid progress, both around GPT-5 and various other misleading data points, especially the no-good, very-bad ‘MIT study.’ I followed up by pointing out that Yes, AI Continues To Make Rapid Progress, Including Towards AGI.

I noticed I was deeply confused about AI consciousness, along with everyone else. I still am, except now I’m more confused at a better, more advanced level. These questions are coming up more and more now, and I expect that to continue.

It’s so funny to have half of people debating AI consciousness, while the other half thinks AI is not making any progress.

I offered my advice around flying.

Are the AIs starting to take our jobs? Not in general, but for entry level jobs? Kinda.

September

I reviewed If Anyone Builds It, Everyone Dies. There were a few weeks where this inspired a lot of discussion, much of it remarkably good.

The month ended with Anthropic reclaiming its role as my daily driver thanks to Claude Sonnet 4.5.

There was more on AI craziness, then later in November we would see additional lawsuits against OpenAI related to suicides.

October

OpenAI meanwhile decided to release Sora and The Big Bright Screen Slop Machine, attempting to turn its good short video generator into a dystopian social network. I said the comparables were Google+ and Clubhouse. Call looks good.

I got to go to The Curve, which was an excellent conference.

One of the consequences of the GPT-5 release was more people talked about AI as potentially being in a bubble. I do not agree, other than in the nominal ‘number might go down’ sense. Number might go down, if not number needs to go up.

OpenAI completed its trio of overhyped releases with the Atlas browser. This jaded people sufficiently that when GPT-5.1 and GPT-5.2 later came out, people gave them remarkably little focus.

Andrej Karpathy went on Dwarkesh Patel and cautioned us not to get overexcited.

The biggest advantage America has over China is its access to vastly more compute. This is thanks in large part to our export controls. Alas, David Sacks it the AI Czar, acts like a de facto Nvidia lobbyist, and is trying to make us give that edge away.

Emboldened by prior success in getting authorization for H20 sales, Nvidia and David Sacks made their move, and came (based on what I know) remarkably close to getting America to commit quite a lot of civilizational suicide and sell B30A chips to China, essentially giving them close to chip parity. This would have been a completely insane move, and we should be thankful a combination of key people stepped up and prevented this from happening.

Unfortunately, although far less unfortunately than if we’d sold B30As, they then regrouped and in December would successfully push, despite it being obviously unwise and unpopular, for us to sell H200s to China. The Chinese are making a show of not wanting them so much, but it’s a show, and our edge has been substantially eroded. The logic behind this seems to have been nominally based in part on a prediction that Huawei can scale chip production far faster than credible predictions say, as in being off by an order of magnitude or more.

OpenAI finished its conversion to a for-profit, completing what I believe is arguably the second largest theft in human history behind the Russian oligarchs of the 1990s. The final terms came as the result of negotiations with the District Attorneys of Delaware and California, and they did extract a lot of highly meaningful concessions, both in terms of compensation and also in helping retain meaningful control and oversight over OpenAI. This could have gone so much worse. But as I said, that’s like a mugger demanded your money, and they got talked down to only taking half your money, then they claim they ‘recapitalized you.’ You’re still out half of your money.

November

We got what may be the final key revelations of what I call OpenAI’s Battle of the Board where the board attempted to fire Sam Altman, as we got Ilya Sustkever’s testimony about what happened. We now know that this was driven by Ilya Sutskever and Mira Murati, and was motivated by ordinary business concerns, centrally Sam Altman’s lying and mistreatment of employees.

I offered my 2025 edition of The Big Nonprofits Post, for those looking to donate, and would later share an update from my nonprofit, Balsa Research.

The year would finish with a flurry of new model releases.

OpenAI started us off with GPT-5.1, a modest upgrade that follows custom instructions well and often glazes the user, and then followed it up with GPT-5.1-Codex-Max, which was a substantial boost in coding power in particular.

Google gave us Gemini 3 Pro, a vast intelligence with no spine and also severe alignment issues and mental problems. It’s a great model, and was clearly now the best at a variety of uses, especially raw intelligence, or a teacher or whom you had questions with known answers that you would ask an autist.

Anthropic then gave us the big one, Claude Opus 4.5, which is for now the clear best model available, and remains my daily driver, both for chat and also in Claude Code.

Claude Opus 4.5 felt like a large practical leap, some like Dean Ball going so far as to call it AGI. I don’t agree but I understand where they are coming from.

December

I went to San Francisco for the Solstice, and wrote Little Echo.

I did the annual movie review.

We learned even more reasons to beware reward mismatches in RL.

OpenAI upgraded again to GPT-5.2, which I evaluated as Frontier Only For The Frontier. Its impressive benchmarks do not reflect its capabilities, and people reacted with fatigue after too many disappointing OpenAI model releases. It’s not an especially ‘fun’ model to interact with, nor is it especially fast, and it currently occupies a sweet spot only for tasks where you need a lot of raw thinking capability and are looking for ‘just the facts’ and cold analysis, and potentially for coding where everyone serious should try various models to see what works best for their tasks.

I offered a sequence of posts on why median wages are up, economists keep saying times are solid, yet young people keep saying things suck. Those complaining often say false things and use statistics wrong, but if so many people think things suck, then you know there’s a problem. I looked into cost changes over time, and when were various things the best. Finally, I presented my thesis, which was that this was due to the Revolution of Rising Expectations and the Revolution of Rising Requirements. Our expectations and comparison points are supremely high, as are the things we legally require of those looking to raise families.

Questions For Next Season

AI is going gangbusters. The news about it is accelerating, not slowing down. It’s going to increasingly impact our lives and be the topic of conversation. The model releases will come fast and furious. The agents will make big leaps in 2026, and not only for coding. It will likely be a major topic in the midterm elections. I don’t expect full High Weirdness in 2026, but you can’t fully rule it out.

Blog growth, in terms of views, stagnated this year. That’s disappointing, as previously I had experienced strong growth, and I likely need to explore additional ways to get the word out. But ‘number go up’ was never the ultimate goal and I am confident that I am directly reaching quite a lot of the people I care about reaching. I do intend to send out a user survey some time in the near future.

One big personal goal for 2026 is to do more coding and evergreen posting, going deeper into questions that matter or that I get curious about, and being better about organizing my thoughts, and to focus less on ephemeral items and news, and to finally get a handle on organizing what I do have to better create longer term resources. I am fully aware that almost all views happen within a few days of posting, but that doesn’t need to dictate anything, and there are some basic things where I could build permanent resources much better than I’ve been doing.

The other big goal is to focus on what matters, including the fights and debates that matter, making sure to do that in a way that adds to permanent resources and not let important things end up buried. I have to do better triage, especially in letting relatively unimportant matters drop. I intend to publish fewer words on the blog in 2026, and with that to become more willing to skip days. I know the amount of content can be overwhelming.

One thing that got lost in the shuffle this year, and illustrates the problem, was my planned review of Open Socrates. It’s a book warning you not to live your life 15 minutes at a time, and I didn’t finish my response because life kept throwing too much stuff at me. Well, that’s kind of the worst possible excuse not to finish that, isn’t it? Even if because of the delay I ultimately have to reread a lot of the book.

I also have a bunch of projects I’d love to try. We’ll see how that goes. But also movies to watch, and games to play, and people to see, and fun to be had. Life beckons.

And you know what? Life is pretty awesome. Other people sing Ald Lang Syne. I go to the Secular Solstice. My personal tradition, at year’s end, is something else entirely.

Happy New Year, everyone.

Double click to interact with video

 

 

 

 

 

 



Discuss

Uncertain Updates: December 2025

31 декабря, 2025 - 19:20
Published on December 31, 2025 4:20 PM GMT

2025 was a rough year for me. My mom died. My cat died. I suffered a concussion, and I had to deal with a few other health issues.

But it was also a good year. I curated my mom’s art. I built an AI oracle. I wrote 37 blog posts, gained 500 MMR in DoTA2, lost 10 pounds, volunteered at 2 conferences, and revised 5 book chapters to make them much, much better. And none of that is to mention all the quality time I got to spend with friends and family and all the cool places I got to visit.

Year boundaries are a good time for setting goals. Here are mine for 2026:

  • finish revisions on Fundamental Uncertainty and get it into print

  • run a conference at Lighthaven (details still in the works, more to come)

  • continue to do whatever I usefully can to prevent existential catastrophes

  • live my life well and love all the people in it

Although I have plenty of reason to worry for the future, I’m generally hopeful, and I look forward to seeing how things unfold in the year to come!



Discuss

Grading my 2022 predictions for 2025

31 декабря, 2025 - 18:45
Published on December 31, 2025 3:45 PM GMT

Three years ago, back in 2022, I wrote "A Tentative Timeline of The Near Future (2022-2025) for Self-Accountability." Well, 2025 is almost over now, so let's see how well I did! I'll go over each individual prediction, and assign myself a subjective grade based on how close I got to the truth. 

Predictions for 2022
  • Post written by AI with minimal prompting reaches 30+ upvotes on LessWrong
    • Score: probably D. I didn't see any high-karma posts from 2022 which were obviously AI-generated, but frankly, I didn't look very hard. I remember reading a few experimental AI-generated posts, but they were all downvoted pretty badly at the time. There were a lot of posts which included smaller excerpts from AI text, but that's not really what I was aiming for, so I'll say I failed this prediction.
  • AI can regularly fool a randomly-selected (from American population), non-expert judge in a 10-minute Turing test.
    • Score: D-. What in the world was I thinking with this one?? I suspect I severely over-updated on stories like Blake Lemoine claiming Google's AI was sentient, not realizing that a chatbot seeming "intelligent" is very different from an AI seeming "human" to people. I think we've passed this point by now in 2025 (so I won't give myself an F), but I was a few years too early.
Predictions for 2023
  • AI reaches human expert level at MATH benchmark.
  • Famous, well-respected public intellectual announces that they believe AI has reached sentience, deserves rights.
    • Score: C-. By this point, a few famous (or newly famous) people (most notably Blake Lemoine in late 2022) were claiming AI sentience, but as far as I can tell, none of them were particularly "well-respected" or considered serious "public intellectuals" by normative standards. I'd say it's an edge-case if I passed this one or not.
  • AI can now write a book with a mostly consistent plot, given roughly a page of prompting or less.
    • Score: A+. I actually thought that I'd failed this one, but I looked it up, and surprisingly (to me), it seems AI was in fact capable of this by 2023! See, for instance, Death of an Author, a novella supposedly written 95%+ by ChatGPT, and described by New Scientist as "not awful." High praise indeed...
  • "Weak" AGI is announced that can play a randomly-selected game on Steam and get at least one achievement (in games which have Steam achievements enabled) most of the time. This assumes someone bothers to try this in particular, if not it should still be obvious it can be done.
    • Score: F. This still doesn't seem to be fully possible in 2025 (although we might be getting pretty close). It certainly wasn't happening (or obvious it could happen) by the end of 2023.
  • AI proves an "interesting" result in mathematics (as judged by professional mathematicians) with minimal prompting.
    • Score: D+. While I don't believe there were any particularly interesting and original AI proofs produced with minimal prompting in 2023, there were some fascinating results produced with the help of AI. An interesting example of this would be FunSearch. I'd say I didn't do too badly on this prediction, although I still technically failed.
  • Major lawsuit involving AI trained on "stolen artwork" gets in the news
  • It is unclear if artists are actually losing significant amounts of work to AI, but plenty of op-eds get written which assume that premise.
  • I move out of my parent's house, possibly to LA for networking/work reasons, possibly remaining in Virginia, for community-building/health reasons. In a possibly related move, I finally come out to my parents, which probably goes okay, albeit with a small chance of being disowned by my grandparents.
    • Score: C. It happened, but I came out to my parents in early 2024, not 2023. The first half of the prediction can't be scored, as I mentioned both possibilities.
  • S.B.F. somehow remains a free, not-in-jail citizen, and continues to post questionable statements on Twitter.
    • Score: F. S.B.F. was in jail by the end of 2023, and although he was under house arrest for the first seven months of the year, that hardly counts as being a "free" citizen, so I'm failing myself on this one.
  • Anti-EA sentiment mostly dies down, but anti "AI safety" sentiment goes way up. The term has become associated with (perceived) censorship, and right-wing politicians may begin to shun people who use "AI safety" in their public branding. AI governance orgs try to adjust by going for a "national security" public angle. [Note that that last bit is incredibly speculative, and depends on too many factors to predict with any real confidence.]
    • Score: B. It didn't take too long after the fall of S.B.F. for anti-EA sentiment to fade from the public spotlight (although it still exists to some extent, especially after the whole Zizian cult disaster), but anti-AI-safety sentiment certainly seems much higher than it was in late 2022. I'm not quite sure how accurate my latter prediction was, but I don't think I was entirely wrong, so that counts for something, I'd say.
  • Multiple people land well-paying coding jobs and publicly post about how they "don't actually know how to code" (beyond some really basic level), but have been outsourcing everything to AI.
    • Score: C-. As far as I can tell, while people were just beginning to "vibe-code" in earnest, there wasn't much public discussion by the end of 2023 of people with no coding knowledge taking coding jobs. By now it's not that unheard of, but it took a few more years than I thought it would.
Predictions for 2024
  • Assuming Donald Trump is not barred from running, he will become president. If not him, it’s an easy DeSantos win. (Biden is the Democratic nominee of course, assuming he's still alive. As usual, the media pays no attention to third party candidates.)
    • Score: A. I didn't do too badly here. Although Biden stepped down at the end while only a presumptive Democratic nominee, "assuming he's still alive" was kind of marginal, so I'll take partial credit for that anyway. 
  • AI writes a NYT best-selling book.
    • Score: D+. As far as I can tell, this did not happen in 2024. However, it seems actively implausible that AI assistance wasn't used to help write a NYT bestseller this year (though to be fair, I don't have direct proof of that), so I'd consider this a close miss.
  • Twitter is still functional, and most users haven't left the site. The workplace environment is kind of miserable though, and content moderation is still severely lacking (according to both sides of the culture war). Elon Musk is largely washed-up, and won't be doing anything too groundbreaking with the remainder of his life (outside of politics perhaps, which I won't rule out).
    • Score: A? I don't think I did too badly on this one. Twitter (now "X") is still fully functional, and it still has a large userbase. There have been multiple waves of layoffs and plenty of reported internal drama there, which sounds pretty miserable to me. Musk's main focus were his DOGE efforts, so he did go into politics, but outside of that, most people seem to consider him well-past his intellectual prime. Obviously this sort of thing is largely subjective, but I think most people would agree my prediction(s) have held up.
  • A minor celebrity or big-name journalist finally discovers Erik Sheader Smith's video game The Endless Empty for the masterpiece it is, kickstarting its growth as widely-hailed classic of the genre. My own game, Nepenthe, is largely forgotten by history, at least until someone discovers a certain easter egg, which is occasionally mentioned in 40+ minute long Youtube videos (you know the type).
    • Score: C+. My friend's masterpiece has not yet been discovered by big-name celebrities or journalists, but it has experienced an explosion in players and fan-artists from China, who do genuinely seem to regard it as a cult classic. The growth is entirely grassroots for now, however. Meanwhile, my videogame, while not entirely forgotten, isn't exactly growing a large fanbase or anything. It doesn't help I've stepped away from making videogames over the past few years (though I'm considering getting back into it).
  • The social media battle going on between those who firmly believe that AI is "just copy-pasting others work" and those who firmly believe that AI is sentient (and want to free it), has reached enough intensity that it gets brought up a few times in the political news cycle. At least one (possibly fringe) candidate pledges to "protect the rights of artists" through AI legislation.
    • Score: B-. I got things directionally right here I think-- except instead of the opposing view being "AI is sentient/deserves rights," it's "AI is helpful; forget about sentience," for the most part. Politicians did seriously talk about protecting artist's rights with AI legislation in 2024, as evidenced by things like the Generative AI Copyright Disclosure Act.
  • Some new video game nobody has heard about before goes viral among schoolchildren, sparking a wave of incredibly forced puns across news headlines worldwide.
    • Score: F. I'm grading myself harshly on this one. Despite there being a few viral indie game hits (like Balatro) in 2024, none of them really went massively viral among schoolchildren in the way something like Five Nights At Freddy's or Undertale did. I did not notice any wave of forced puns relating to said games, either.
  • China's economy has pretty much recovered from Covid. Other than that, hard to predict, but growth won't look terribly different from the rest of the world.
  • Companies start actually replacing a significant number of customer support jobs with AI. Consumers generally report being more satisfied as a result, to many people's annoyance.
  • Both teachers and students have the ability to easily automate online assignment work, leading to a growing number of absurdist scenarios where algorithms play meaningless educational games while teachers and students do their own thing, unwatching. This is objectively hilarious, but people get mad about it, leading to a poorly-managed escalation of the school surveillance arms race we already see today.
    • Score: A. Another win for my predictive abilities...not so much for the rest of the world. This pretty much came to pass, but I'm not giving myself an A+ because it's not clear to me just how much school surveillance has actually increased as a direct result of AI cheating concerns (though AI-powered school surveillance has certainly increased since 2022).
  • Another billionaire has emerged as an EA mega-donor.
    • Score: D. We still have Dustin Moskovitz (and his wife Cari Tuna) as billionaire mega-donors, but they aren't exactly new on the scene. Sadly, I was wrong about this one.
Predictions for 2025
  • Self-driving cars (and drone delivery) never quite reach market saturation due to some consumer/cultural pushback, but mostly due to legislation over "safety concerns," even if self-driving is significantly safer than human-driven vehicles by this point. However, more and more self-driving-adjacent features are added into "normal" cars for "safety reasons," so it's become increasingly hard to delineate any sort of clear line between AI and human-operated vehicles.
    • Score: A. This seems to be pretty much on the nose! The only potential issue is it's arguably debatable if self-driving is truly "significantly safer" than human driving, mostly due to issues like mass-outages during crises situations. I think it's safer, but I can see how a reasonable person might disagree, so I'm not giving myself an A+.
  • I am in love.
    • Score: A. It's a long and dramatic story, but this isn't the time or place to share it...
  • A mass fatality event occurs due to what could plausibly be interpreted as "misaligned AI." This sparks some countries to pass a whole bunch of AI-related laws which are totally ignored by other countries. The AI safety community is split on if the blame for what happened should be placed on misaligned AI, human error, or some complex mix of both. For whatever reason, a popular language model (developed for entertainment perhaps) publicly takes responsibility, despite seemingly having nothing to do with the incident. For the most part though, this is treated as just another tragedy in the news cycle, and is ignored by most people.
    • Score: D. There was no single "mass fatality event" caused by AI this year. That being said, there have been a significant number of murders and suicides plausibly linked to AI psychosis, which, if considered together, likely resulted in a large number of unnecessary deaths. It's debatable to me if this should count, but I'm leaning against it, as it's not the sort of thing I was envisioning at the time, I think. There have indeed been a number of irregularly enforced AI safety laws passed, but not as many as I would have expected. I was correct that people are split over how much AI is to blame for the deaths that have occurred, but incorrect that an AI would erroneously take the blame on itself for said deaths. And indeed, most people simply ignore the whole thing, and it's not the primary driver of the news cycle this year.
  • Someone who has at some point called themself "rationalist" or "EA" commits a serious crime with the intention of halting capabilities gain at some company or another. This is totally ineffective, everyone agrees that that was like, the least rational or altruistic action they could have possibly taken, but the media runs with exactly the sort of story you'd expect it to run with. This makes AI governance work a bit harder, and further dampens communications between safety and capabilities researchers. Overall though, things pretty much move on.
  • Despite having more funding than ever before, the quality and quantity of AI safety research seems...slightly lesser. It's unclear what the exact cause is, though some point out that they've been having a harder time staying focused lately, what with [insert groundbreaking new technology here].
    • Score: C. AI safety funding is indeed going strong. It is unclear to me if research is better or worse than it was in late 2022, but AI safety research in general seems to have taken a backseat within the largest AI companies, which is worrying. Some research does suggest that using tools like Cursor actually slowed developers down, despite a perception that it was speeding up work, which arguably counts as a partial win for my prediction.
  • Youtube dies a horrible death in a totally unpredictable manner. The whole disaster is retroactively considered clearly inevitable by experts. There is much mourning and gnashing of teeth, but the memes, too, are bountiful.
    • Score: F. This did not happen.
  • The sun rises and the sun falls.
    • Score: A+. This actually happened multiple times!
  • Me and my friends are still alive.
    • Score: B. I am still alive, and so are most of my friends, but there are a few who seem to have disappeared from the internet, and I am worried about them. I hope they are okay, but I have no guarantee, so I don't feel comfortable giving this an A+.
Conclusion

Um...I'm not sure what conclusion to take away from all of this. Predicting the future is hard, and I certainly failed a lot, but also, I was pleasantly surprised to see how much I got right, or at least got directionally correct. It seems like I generally over-updated on the rate of advancement in 2022, and assumed things would move faster than they did. That being said, I really don't think I did too badly compared to those around me at the time, and I'm proud of what I did get right.

Happy New Years!



Discuss

Me, Myself, and AI

31 декабря, 2025 - 18:28
Published on December 31, 2025 7:27 AM GMT

I’ve spent several weeks and months dabbling with AI in the form of various chat interfaces from VS Code, Cursor, Antigravity, and Mac Terminal. Each provided their own twist on the formula of typing something, and some cool shit pops out.

When I was a wee young lad, I went to my cousin’s house for New Year’s Eve while the adults went out to have fun. I was the baby boy in the family, so I was not even considered when asked to play with the new computer. It was glorious. The screen was in the terminal green, which most of us can picture in our minds. This is why The Matrix hit us so hard with the machine code bent out of reality and forced a large number of us to start thinking we might be in a simulation.

My brother and cousins were extremely intelligent. They knew that the micro PC was the best Christmas present because it was a huge flex living on the edge of suburban and rural Georgia. It was such a cool toy because the first experience I had with a computer wasn’t watching code stream in a shell, but the expression of that code: Snake.

The goal of the game was to eat and grow the snake, but don’t let it eat itself. It’s an integral tenet in biology. The goal is not die, which usually means eating other living things. If you eat yourself or your family, something might be wrong. Human starvation results in autocannibalism through various mechanisms, but if you don’t eat, you will die while trying to eat yourself. 

You might think that eating another human will be OK if all else fails. As some of our ancestors found out, eating other humans can result in neurological disease worse than death, and then death. Prions are a pathogen that results from human cannibalism, and most likely the reason why we learned not to eat other humans. I am not sure about Neanderthals.

Prions can come from a few sources, but one known pathway is when humans begin to eat other humans, especially the brain. Some prion-like diseases will come to mind, like Alzheimer’s, and how the tau protein is most likely the culprit, but also the most similar to a prion, but is not one. The danger is that prions act outside the central dogma of biology because information can pass from protein to protein, not through the normal channels of replication, transcription, and translation.

Simply, a prion, or a misfolded protein, interacts with a normal version of that protein, resulting in a conformational change into the prion version. There are no cures for human prions, and they result in an unstoppable death cascade. This is why that snake game popped out at me during this time, in this moment. Once the snake starts eating its own tail, it can’t stop, and death is inevitable. Game over.

I started thinking about prions and wondered why we couldn’t create therapies as we have with viruses and other pathogens. Prions are just a single protein, but far more deadly than any other transmissible disease vector. Could we develop prions as a therapy and oppose prion-like diseases like Alzheimer’s? It seemed to me we were missing something huge in medicine and aging biology.

During my prion research, I learned that they were discovered with the help of fungi. In one species with a mycelium network, a prion acts as a determinant of colony origin. Basically, this benefactor determines self versus non-self. It fucked my world-view up. There was a prion that was actually useful and acted like a mushroom immune system rather than a disease with a guaranteed outcome of death for the host. 

Are there versions of this type of prion in humans? The answer is we do not know. Single-cell proteomics is cost-prohibitive and limited. Despite this, I still think prions or protein-to-protein information transmission is one of the reasons we age and die. 

The determination of the self starts immediately when the embryo forms because that is the key to development. We are born despite our mother’s immune system, which tries to kill invaders hijacking resources. The embryo is a type of pathogen, specifically a parasite, if you change the context. The point is that our body and mind develop in a foreign environment in opposition to a very effective and intelligent force. The baby develops because it is a hybrid of the mother’s and father’s immune system. If the immune system of the father is too different from that of the embryo, the embryo does not survive.

There are so many embryos, sperm, and eggs that never had the chance to develop into humans, much less adults. The number of potential humans is far higher than the number of humans achieving adulthood. A Nobel Prize was given to a scientist proving that cells can be reset via the central dogma of biology, but no one has proven that cells transfer information through the cytoplasm, which I think comes in the form of long-lived, persistent proteins that may or may not be similar to prions. Some of these proteins are the backbone of the immune system and come from the mother through the cytoplasm, which is the basis for my cytoclock aging biology timer. (I am saving this for another post.)

The immune system forms antigen receptors through the V(D)J system, resulting in a vast database that can detect antigens, which are signals of non-self. At first, I thought creating a model based on the immune system intelligence could work. It still might, but I have not found anyone that developed a frontier immune system model on par with current LLMs. Instead, artificial intelligence was designed and developed based on the human mind without internal opposition. The brain developed without the fundamental force that drives evolving systems: the immune system.

I love a good pivot, and at some point, I had the idea of creating an artificial immune system (AIS) from multiple artificial intelligences (AIs), which I recently dubbed antigents, to help new frontier models determine self from the start. AIS have been around for a while because biologically inspired design works in computing. Ultimately, I think the agentic model needs an opposing force: antigentic models, which are trained artificial immune systems to counteract cascading critical errors that cannot be cured, much like a prion or prion-like disease. 

I have been working with different AI models to develop my core hypothesis, resulting in an AIS layer for safety, security, and risk management. It is in the early stages, but I did apply for a grant, and I am publishing a proof-of-concept and early-stage software package that replicates a recent research paper and builds on previous AIS iterations. AIS are not new; they haven’t been applied at the level I am discussing. The concept and idea were there through several forms over the decades, including antivirus software.

Instead of being reactive to healing AIs after they develop and become Frankenstein’s monster, let us redesign the human body’s source of truth for artificial general intelligence. Based on my initial research and discussions with others, I think this can be applied inside or outside the magical black box. I hypothesize that creating an AIS will result in mitigating current agentic critical errors and provide the development of future frontier models in alignment with humanity at inception. (I was trying to choose how to use conception since it is more in line with the development of the embryo, but what can you do?) 



Discuss

Mystical and psychotic states: similarities and differences

31 декабря, 2025 - 18:09
Published on December 31, 2025 2:02 PM GMT

This post is a reflection on the Critical Meditation Theory[1] by lsusr. I find it interesting as an attempt to integrate all states of consciousness into the matrix of experience. As someone who has experienced both states, it got me thinking: what are the similarities between them and what are the differences.

I'll start with a disclaimer: I am just a curious layman reflecting on this difficult topic by observing my experience and some data to support my words. So it's just a speculation from one particular and peculiar observer.

First, it is difficult to define what a mystical state is even to those who are studying them, so I will use a series of descriptions which underlie most mystical states: the "oceanic" feeling of oneness, loss of the self, perfect order, equanimity, stillness, peace, reduced/no thoughts. Commonly known examples include: "flow" in sports, deep absorption in chanting, etc.

Second, a similar description of a psychotic state: strong feeling of alienation, pronounced and distorted ego (usually with some "saving mission"), disorder, distress, barrage of thoughts.

So as one can imagine they are diametrically opposite to each other. And as a rule, while one wants to return to the mystical state, one doesn't want to return to the psychotic state under any conditions.  

However, I would like to emphasize that everything is not so simple in life, and sometimes disorientation might be the inevitable process of living and adaptation and even a good sign (if it leads to insight). Gary Weber (PhD in Materials Physics, who reached no-thought state) in his book Happiness Beyond Thought[2] elaborates on the nature of disoriented states:

Complex systems theory is now being applied to psychological systems. In The Psychological Meaning of Chaos, Masterpasqua and Perna describe how traditional views of equilibrium and stability are assumed to connote healthy mental states, while non-equilibrium and disorder are judged to be unhealthy. Their contrary approach is that the opposite may actually be more correct. They believe that psychological pain results when one becomes locked in a futile attempt at stability and equilibrium “in order to maintain an old way of knowing and to resist the inevitable emergent novelty woven into the process of living.”[3]

They also conclude that what looks like a disordered state may actually be the best way to deal with a continuously restructuring self that is the result of the complexity of today’s world.

But for the most part, psychoses are best avoided.

What drives mystical states? Gary Weber in his lecture Myths about Nonduality and Science[4] quotes Newberg and d'Aquili who studied them:

The idea behind the Newberg and d’Aquili’s model being if we can run these [sympathetic and parasympathetic nervous systems] so you actually have the two of them at the same time fully activated — they will conflict and they will shut down the inputs to the temporal and parietal lobes, that do the important things for mystical experiences. You just jam the circuits and you just stop anything from going to these places that are expecting input, and almost always get input, and all of a sudden there’s no input. And they’ve postulated that that’s exactly what pushes mystical states into being.

In other words, blocking input to particular regions of the brain.

What drives psychotic states? There is no single satisfying answer (genetic predisposition plays a role, but it is not the whole picture). But I would present a hypothesis that represents a Pavlovian school of thought. Ivan Pavlov after his experiments on dogs[5] postulated that the driver behind psychotic states is a strong ultraparadox state (granted that Pavlov’s framing is historical and rather metaphorical, while contemporary psychiatry uses different models, it is however helpful to elucidate the issue at hand). Ultraparadox state is a state where majority of inputs get disordered and often contradictory outputs. What's important: how one ends up in this ultraparadox state. Pavlov postulated that an ultraparadox state is driven by a strong existential contradiction or series of contradictions that cannot be resolved. For example, "I have to run", and at the same time, "I have no legs".

Therefore one might deduce that a psychotic state may be induced by a strong existential contradiction (or series of contradictions) in thinking which cannot be resolved. The keyword here is contradiction. I'm going to postulate that contradiction is what mystical and psychotic states have in common. And later will reflect on their major difference.

Where does the semblance of cause for mystical and psychotic states lie? What is a typical koan practice (or insight practice)? It is introduction of an experiential contradiction into thinking which cannot be resolved on the rational level, e.g. Who am I? Who hears? etc. Which have an inhibitory effect for the particular regions of the brain. So the common factor in both states is a contradiction that cannot be resolved rationally.

Further, I want to postulate a major difference. But first, I will define an existential threat as an existential contradiction that threatens the integrity of the organism (e.g. "I want to be safe."/"Someone threatens me with a knife."). I assume that during the anticipation of the existential threat the predominant default modes of thinking are initiated. As most of us identify with the body by default — i.e. when we say "I", we mean the body-mind complex — it is the default system that gets initiated during the anticipation of the existential threat. That is, the anticipation of the existential threat strengthens the self model currently operating in thinking.

And that leads us to a major difference between contradictions introduced in Zen practice of koans and the anticipation of an existential threat. In Zen practice (and other meditation practices) we switch from the ruminating network responsible for building the images "self in time" and "self and other"[6], the Default Mode Network (DMN), to the Task-Positive Network (TPN) to resolve the experiential contradiction[7]. So it inhibits signals from going to the regions of the brain that are building the model of the self (in effect shutting down the mechanism that builds "self and other" and "self in time", hence we get mystical "all is one" and "now, now, now"). While during the anticipation of the existential threat we are activating regions of the brain that are building the model of the self (feeding "self and other" and "self in time").

Insight practice deactivates the DMN by stabilizing attention in the TPN, whereas anticipating an existential threat does the opposite and activates the DMN. In one case, we are inhibiting the signals from going to the DMN and in another case we are activating them.

That, I suggest, is the major difference between the two states. If one is stuck in heavy self-rumination it makes one prone to experience of a psychotic state. If one is busy tasking, it makes one more open to experiencing a mystical state ("flow" is the most widespread mystical state available to anyone). Given that schizophrenia is associated with a hyperactive DMN, it is no wonder this condition entails greater vulnerability to psychoses, as the brain struggles to stabilize in the TPN.

To sum up, what mystical and psychotic states have in common is a contradiction in thinking. But the nature of that contradiction varies significantly: mystical states are reached through the activation of the TPN; while psychotic states are ignited through the overload of the DMN. One inhibits the self-ruminating network; the other activates it.

In view of this model, how can one minimize the risk of going psychotic? In times of uncertainty (hoping that those are not existential threats) it is by activating the TPN and stabilizing there, i.e. by tasking. The means for this are numerous, but the best thing is to do the task one loves doing (if common means are not available, then self-inquiry[8], breathing practice, chanting[9], mudras[10], etc. might do the trick). 

  1. ^

     lsusr, Critical Meditation Theory.

  2. ^

    Gary Weber, Happiness Beyond Thought.

  3. ^

    Masterpasqua, F., and Perna, P., The Psychological Meaning of Chaos: Translating Theory into Practice, American Psychological Association, Washington, DC, 1997, 36-37.

  4. ^

    Myths about Nonduality and Science

  5. ^

    I.P. Pavlov, Lectures on Conditioned Reflexes. (Twenty-Five Years of Objective Study of the Higher Nervous Activity (Behaviour) of Animals).

  6. ^

    Jessica R. Andrews-Hanna et al., Functional-Anatomic Fractionation of the Brain's Default Network.

  7. ^

    Judson A. Brewer et al., Meditation experience is associated with differences in default mode network activity and connectivity.

  8. ^

    Self-inquiry.

  9. ^

    Gary Weber, Simple Chants for NonDual Awakening.

  10. ^

    Gary Weber, Kirtan Kriya for Stress, Comprehension, Memory.



Discuss

My 2025 in review

31 декабря, 2025 - 17:46
Published on December 31, 2025 2:46 PM GMT

Everyone loves writing annual letters these days. It’s the thing. (I blame Dan Wang.)

So here’s mine. At least I can say I’ve been doing it for as long as Dan: nine years running (proof: 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024). As usual, this is more of a personal essay/reflection, and not so much of an organizational annual report, although I will start with some comments on…

RPI

Over the last three years, the Roots of Progress Institute has gone from “a guy and his blog” to a full-fledged cultural institute. This year we:

  • Held our second annual Progress Conference, featuring speakers including Sam Altman, Blake Scholl, Tyler Cowen, and Michael Kratsios (Director, OSTP). The conference has become the central, must-attend event for the progress community: it is sold out each year, with hundreds on the waitlist, and some attendees report it is literally the best conference they have ever attended.
  • Inducted the third cohort of our progress writers fellowship, bringing the total to 74 fellows. Our fellows are having impact: Dean Ball helped draft the Trump administration’s AI Action Plan, Madeline Hart has co-authored a book with the CTO of Palantir on revitalizing the American defense industry, Ryan Puzycki helped legalize single-stair buildings in Austin (a key YIMBY reform), and three other fellows have recently had opinion pieces in the NYT or WSJ.
  • Announced our first education initiative: Progress in Medicine, a high school summer career exploration program. I’ve previewed the content for this course and I’m jealous of these kids—I wish I had had something like this when I was a teenager!

And the best part about all of these programs is that I don’t have to run any of them! I have a fantastic staff at RPI who deserves credit for all of these, from design to execution: Emma McAleavy, Ben Thomas, Yel Alonzo, and especially Heike Larson—thanks to them for making our programs a success every year.

We’re a 501(c)(3) nonprofit, supported mostly by donations. There’s still time to get in a last-minute end-of-year contribution. Huge thanks to all those who have already given this year!

My writing

Most of my writing effort this year was devoted to finishing The Techno-Humanist Manifesto, an essay-series-cum-book laying out my philosophy of progress. In 2025 I published the last 14 (out of 21) essays in the series, you can read them all here. Also, as just announced, I’ve signed with MIT Press to publish a revised version of the series in book form. The manuscript is out for comment now, and (given typical publishing schedules) I expect the book to launch in early 2027.

I also wrote eight other essays, and ten links digests. I put the links digest on hold after May in order to focus on finishing the book, but I’m working on bringing it back. All subscribers get the announcements and opportunities at the top, but the rest of the digest is paywalled, so subscribe now to get the full version.

The most-liked posts here on Substack were:

The most-commented posts were:

My longest post, at over 8,400 words, was:

I now have well over 55,000 subscribers on Substack, up over 68% YOY.

Social media

Here are some of my most-liked posts and threads of the year:

You can join well over 40,000 people who follow me on Twitter, or find me on your favorite social network; I’m on pretty much all of them.

Speaking and events

Like last year, I tried to mostly say no to events and speaking gigs this year, but there were a few I couldn’t refuse. Some highlights of the year:

  • I spoke at “d/acc Day” alongside Vitalik Buterin, Juan Benet, Mary Lou Jepsen, Allison Duettmann, and others. My talk was “d/acc: The first 150 years”: a whirlwind tour of how society has thought about progress, decentralization and defense over the last century and a half
  • I gave a short talk at Social Science Foo Camp titled “The Fourth Age of Humanity?”, based on ideas that I later wrote up in The Flywheel
  • I did a fun Interintellect salon with Virginia Postrel based on her essay “The World of Tomorrow
  • I hosted a discussion series at Edge Esmeralda with the aim of envisioning the future. Each day there was a ~90-minutes session with a theme like AI, health & bio, or energy
  • I went to Mojave to watch the first supersonic flight of the Boom XB-1 test plane. Here’s some video I took of the plane taxiing down the runway, and then the pilot getting out after landing and shaking hands with Boom founder Blake Scholl

In 2026 I hope to do more travel, events and speaking. But maybe I’ll just hole up and write some more.

Reading

I put my monthly “what I’ve been reading” updates on hold at the end of 2023 (!) in order to focus on the book. I’d like to bring these back, too. For now, here are some the highlights from my reading this year (that is, things I thought were interesting and valuable to read, not necessarily things I “liked” or agreed with).

Books and other book-length things I read

Or read at least most of:

Max Bennett, A Brief History of Intelligence. A history of the evolution of the brain, from the first animals through humans. It is organized into five major evolutionary steps—to oversimplify: the worm brain, the fish brain, the mouse brain, the monkey brain, and the human brain. This answered some key questions I had on the topic, very well-written, probably my favorite of the year. Hat-tip to @eshear.

Charles Mann, How the System Works, an essay series in The New Atlantis. It covers four of the major systems that form the foundation of industrial civilization and help deliver our modern standard of living: agriculture, water sanitation, electricity, and public health. Mann thinks of these pieces as the start of a curriculum that should be taught in schools—inspired by a group of “smart, well-educated twenty-somethings” who “wanted the hungry to be fed, the thirsty to have water, the poor to have light, the sick to be well,” but “knew little about the mechanisms of today’s food, water, energy, and public-health systems. They wanted a better world, but they didn’t know how this one worked.” Enjoyed this, recommended.

Brian Potter, The Origins of Efficiency, from Stripe Press, a history of manufacturing efficiency. Light bulbs used to cost ~$50 (adjusted for inflation), now they cost 50 cents; how did that happen? This is a comprehensive and very readable overview of the answer to that question and others like it.

For the (much longer) full reading update, and some thoughts on what’s next for my writing, subscribe on Substack.



Discuss

Progress update: synthetic models of natural data

31 декабря, 2025 - 04:31
Published on December 31, 2025 1:31 AM GMT

This post presents a brief progress update on the research I am doing as a part of the renormalization research group at Principles of Intelligence (PIBBSS). The code to generate synthetic datasets based on the percolation data model is available in this repository. It employs a newly developed algorithm to construct a dataset in a way that explicitly and iteratively reveals its innate hierarchical structure. Increasing the number of data points corresponds to representing the same dataset at a more fine-grained level of abstraction.

Introduction

Ambitious mechanistic interpretability requires understanding the structure that neural networks uncover from data. A quantitative theoretical model of natural data's organizing structure would be of great value for AI safety. In particular, it would allow researchers to build interpretability tools that decompose neural networks along their natural scales of abstraction, and to create principled synthetic datasets to validate and improve those tools.

A useful data structure model should reproduce natural data's empirical properties:

  • Sparse: relevant latent variables occur, and co-occur, rarely.
  • Hierarchical: these variables interact compositionally at many levels.
  • Low-dimensional: representations can be compressed because the space of valid samples is highly constrained.
  • Power-law-distributed: meaningful categories exist over many scales, with a long tail.

To this end, I'm investigating a data model based on high-dimensional percolation theory that describes statistically self-similar, sparse, and power-law-distributed data distributional structure. I originally developed this model to better understand neural scaling laws. In my current project, I'm creating concrete synthetic datasets based on the percolation model. Because these datasets have associated ground-truth latent features, I will explore the extent to which they can provide a testbed for developing improved interpretability tools. By applying the percolation model to interpretability, I also hope to test its predictive power, for example, by investigating whether similar failure modes (e.g. feature splitting) occur across synthetic and natural data distributions.

The motivation behind this research is to develop a simple, analytically tractable model of multiscale data structure that to the extent possible usefully predicts the structure of concepts learned by optimal AI systems. From the viewpoint of theoretical AI alignment, this research direction complements approaches that aim to develop a theory of concepts.

Percolation Theory

The branch of physics concerned with analyzing the properties of clusters of randomly occupied units on a lattice is called percolation theory (Stauffer & Aharony, 1994). In this framework, sites (or bonds) are occupied independently at random with probability p.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , and connected sites form clusters. While direct numerical simulation of percolation on a high-dimensional lattice is intractable due to the curse of dimensionality, the high-dimensional problem is exactly solvable analytically. Clusters are vanishingly unlikely to have loops (in high dimensions, a random path doesn't self-intersect), and the problem can be approximated by modeling the lattice as an infinite tree[1]. In particular, percolation clusters on a high-dimensional lattice (at or above the upper critical dimension d≥6) that are at or near criticality can be accurately modeled using the Bethe lattice, an infinite treelike graph in which each node has identical degree z. For site or bond percolation on the Bethe lattice, the percolation threshold is pc=1/(z−1). Using the Bethe lattice as an approximate model of a hypercubic lattice of dimension d gives z=2d and pc=1/(2d−1). A brief self-contained review based on standard references can be found in Brill (2025, App. A).

Algorithm

The repository implements an algorithm to simulate a data distribution modeled as a critical percolation cluster distribution on a large high-dimensional lattice, using an explicitly hierarchical approach. The algorithm consists of two stages. First, in the generation stage, a set of percolation clusters is generated iteratively. Each iteration represents a single "fine-graining" step in which a single data point (site) is decomposed into two related data points. The generation stage produces a set of undirected, treelike graphs representing the clusters, and a forest of binary latent features that denote each point's membership in a cluster or subcluster. Each point has an associated value that is a function of its latent subcluster membership features. Second, in the embedding stage, the graphs are embedded into a vector space following a branching random walk.

In the generation stage, each iteration follows one of two alternatives. With probability create_prob, a new cluster with one point is created. Otherwise, an existing point is selected at random and removed, becoming a latent feature. This parent is replaced by two new child points connected to each other by a new edge. Call these points a and b. The child points a and b are assigned values as a stochastic function of the parent's value. Each former neighbor of the parent is then connected to either a with probability split_prob, or to b with probability 1 - split_prob. The parameter values that yield the correct cluster structure can be shown to be create_prob = 1/3 and split_prob = 0.2096414. The derivations of these values and full details on the algorithm will be presented in a forthcoming publication.

Caveats
  • Because the data generation and embedding procedures are stochastic, any studies should be repeated using multiple datasets generated using different random seeds.
  • The embedding procedure relies on the statistical tendency of random vectors to be approximately orthogonal in high dimensions. An embedding dimension of O(100) or greater is recommended to avoid rare discrepancies between the nearest neighbors in the percolation graph structure and embedded data points.
  • A generated dataset represents a data distribution, i.e. the set of all possible data points that could theoretically be observed. To obtain a realistic analog of a machine learning dataset, only a tiny subset of a generated dataset should be used for training.
Next Steps

In the coming months, I hope to share more details on this work as I scale up the synthetic datasets, train neural networks on the data, and interpret those networks. The data model intrinsically defines scales of reconstruction quality corresponding to learning more clusters and interpolating them at higher resolution. Because of this, I'm particularly excited about the potential to develop interpretability metrics for these datasets that trade off the breadth and depth of recovered concepts in a principled way.

  1. ^

    Percolation on a tree can be thought of as the mean-field approximation for percolation on a lattice, neglecting the possibility of closed loops.



Discuss

Personalization Requires Data

31 декабря, 2025 - 03:45
Published on December 31, 2025 12:45 AM GMT

​In 2025, AI models learned to effectively search and process vast amounts of information to take actions. This has shown its colors the most in coding, eg through harnesses like claude code that have had a sizable impact on programmers’ workflows.

But this year of progress doesn’t seem to have had that much of an effect on the personalization of our interactions with models, ie whether models understand the user’s context, what they care about, and their intentions, in a way that allows them to answer better. Most chatbot users’ personalization is still limited to a system prompt. Memory features don’t seem that effective at actually learning things about the user.

But the reason machine learning has not succeeded at personalization is lack of data (as it often is!). We do not have any validated ways of grading ML methods (training or outer-loop) for personalization. Getting good eval signals for personalization is hard, because grading a model’s personalization is intrinsically subjective, and requires feedback at the level of the life and interactions that the model is being personalized to. There is no verified signal, and building a generic rubric seems hard. These facts do not mesh well with the current direction of machine learning, which is just now starting to go beyond verifiable rewards into rubrics, and is fueled by narratives of human replacement that make personalization not key (If I am building the recursively improving autonomous agi, why do I need to make it personalized).1

Start by giving models more information

But how do we obtain data for personalization? I think the first step to answering this question is having consumers of AI curate their personal data and then share it to enrich their interactions with AI systems.

Instead of just a system prompt, giving models a searchable artifact of our writing, notes, and reading history. Something agents can explore when your questions might benefit from context—and write to, to remember things for later.

whorl - My first guess

Over the break, I built a very simple software tool to do this. It’s called whorl, and you can install it here.

whorl is a local server that holds any text you give it—journal entries, website posts, documents, reading notes, etc…—and exposes an MCP that lets models search and query it. Point it at a folder or upload files.

I gave it my journals, website, and miscellaneous docs, and started using Claude Code with the whorl MCP. Its responses were much more personalized to my actual context and experiences.

Examples

First I asked it:

do a deep investigation of this personal knowledge base, and make a text representation of the user. this is a text that another model could be prompted with, and would lead them to interacting with the user in a way the user would enjoy more

It ran a bunch of bash and search calls, thought for a bit, and then made a detailed profile of me, and my guess is that its quality beats many low effort system prompts, linked here.

I’m an ML researcher, so I then asked it to recommend papers and explain the motivation for various recs. Many of these I’ve already read, but it has some interesting suggestions, quite above my usual experience with these kinds of prompts. See here.

These prompts are those where the effect of personalization is most clear, but this is also useful in general chat convos, allowing the model to query and search for details that might be relevant.

It can also use the MCP to modify and correct the artifact provided it, to optimize for later interactions – especially if you host a “user guide” there like the one I linked. Intentionally sharing personal data artifacts is the first step to having agents that understand you.

Conclusion

Personalization requires data. People need to invest into seeing what models can do with their current data, and figuring out what flows and kinds of interactions this data is useful for, towards building technology that can empower humans towards their goals. whorl is a simple tool that makes that first step easy. People who have already created a bunch of content should use that content to enhance their interactions with AIs.



Discuss

Please remember how strange this all is.

31 декабря, 2025 - 03:36
Published on December 31, 2025 12:36 AM GMT

Please remember how strange this all is.

I am sitting in an airport in San Francisco. It is October 2025. I will get in a box today. It will take my body around the world in unbreathable air at 600mph.

The machines I see outside the departure lounge window are complicated and odd. Millions of computer chips and wires and precisely designed metal structures. Gears and belts and buttons. No individual knows how these things work.

I look down at my body. Even more unimaginably complex. An intricate soup of skin, DNA, fat and protein. Enzyme cascades and neuronal developmental pathways. These cascades are collectively producing these words somehow.

Please remember how strange this all is.

All this stuff exists, but we don’t know why. I am seeing and feeling and thinking about all this stuff and we don’t know why any of it is here. Nobody does. We make plans, we gossip, we deliver projects and look forward to holidays. We social climb and have sex and hold hands. We go swimming on a Saturday morning with a close friend and talk about our relationships and the water temperature and we silently agree to forget how deeply strange it is that any of this is even here and is like it is.

Please remember how strange this all is.

Experience is so absurdly specific. So surprisingly detailed. I am lost in my story; in the excruciatingly specific normality of it. Occasionally I remember. An obvious contrast. A strong memory. A flash of confusion. I sometimes remember, but I mostly forget. Remembering usually feels like a distraction from the thing that is happening now. It often is. I ask that you remember anyway.

Please remember how strange this all is.

Is this cliché? Am I being cliché? Or is that feeling of cliché-ness just more forgetting? More “this is normal”, more “this is usual and expected”.

We walk past each other on the street and forget the absurd mystery of why any of this is here. The strangeness and lostness in stories is the most reliable feature of all of our reality. Our confusion is the core vulnerability that we all share. Join me in the one place we can all meet.

Please remember how strange this all is.

The music playing in my ears. The movement of my pen on this paper. The feeling that these words are me. The flash of a vivid memory from last night. The complex web of social plans. The implicit meta-physics my thoughts are nestled within.

Please remember how strange this all is.

The woman behind the counter at the departure lounge café. The sound of boarding announcements. The complex array of brands of drink. Colourful and alluring and strange. The artwork in front of me is paper boats in water.

Please remember how strange this all is.

I talked to an old friend this morning in an Italian restaurant in The Embarcadero. He’s worried about AI and is dreaming of buying a house in the countryside. He wants to move away from the bay and stop fighting for the future of humanity.

Please remember how strange this all is.

Also remember to breathe. Breathe deep. Breathe deep through your nose and into your belly. Remember the centre. Remember to feel into your heart. Touch grass with your feet. Notice the consistent patterns and trust the context of your own perception. Seriously, remember to be breathe.

Then let go of that too. And remember again how deeply strange this all is.



Discuss

Mechanize Work's essay on Unfalsifiable Doom

31 декабря, 2025 - 01:57
Published on December 30, 2025 10:57 PM GMT

Like Daniel Kokotajlo's coverage of Vitalik's response to AI-2027, I've copied the author's text. However, I would like to comment upon potential errors right in the text, since it would be clearer.

Our critics tell us that our work will destroy the world.

We want to engage with these critics, but there is no standard argument to respond to, no single text that unifies the AI safety community. Nonetheless, while this community lacks a central unifying argument, it does have a central figure: Eliezer Yudkowsky.

Moreover, Yudkowsky, along with his colleague Nate Soares (hereafter Y&S), have recently published a book. This new book comes closer than anything else to a canonical case for AI doom. It is titled “If Anyone Builds It, Everyone Dies”.

Given the title, one would expect the book to be filled with evidence for why, if we build it, everyone will die. But it is not. To prove their case, Y&S rely instead on vague theoretical arguments, illustrated through lengthy parables and analogies. Nearly every chapter either opens with an allegory or is itself a fictional story, with one of the book’s three parts consisting entirely of a story about a fictional AI named “Sable”.

S.K.'s comment: these arguments are arguably easy to concentrate in a few phrases. Chapter 1 explains why mankind's special power is the ability to develop intelligence. Chapter 2 is supposed to convey the message that mankind's interpretability techniques usable for understanding the AI's mind are far from enough to understand why the mind does some action and not some other. Chapter 3 explains that the machines can optimize the world towards a state even more efficiently than the humans despite having no human-like parts. Chapter 4 explains that the AI's actual goals correlate, at best, with what was reinforced during training and not with the objective that the humans tried to instill. For example, the reward function of AIs like GPT-4o-sycophant was biased towards flattering the user or outright eliciting engagement by inducing psychosis[1] despite the fact that it is a clear violation of OpenAI's Model Spec.  Chapter 5 has Yudkowsky claim that the actual goals of AIs are hard to predict[2] and that they wouldn't correlate with mankind's flourishing, causing the AI to confront mankind... and win, as evidenced by Chapter 6.

The scenario itself is chapters 7-9 exploring HOW the ASI might defeat us. Or choose any other strategy, like the Race branch[3] of the AI-2027 forecast where the AI fakes alignment until it is ready for the confrontation.

Chapter 10-11 have the authors explain that alignment is far from being solved and that there is nothing to test it on. While most Yudkowsky's arguments are meta-level, like comparing the AIs with nuclear reactors whose behavior was hard to understand[4] at the time, there is an object-level case against ways to test alignment. SOTA AIs are likely aware that they are being tested and future AIs are UNlikely to be unaware whether they can live independently, take over the world or create a superintelligent successor superaligned to them and not to us. Unless the AIs are aware that they can escape or take over the world, they could be unlikely to even bother to try.

Chapters since Chapter 12 are devoted to proving that measures against a superintelligent AI require international coordination to prevent anyone from creating the ASI before alignment is actually solved.

When the argument you’re replying to is more of an extended metaphor than an argument, it becomes challenging to clearly identify what the authors are trying to say. Y&S do not cleanly lay out their premises, nor do they present a testable theory that can be falsified with data. This makes crafting a reply inherently difficult.

S.K.'s comment: Yudkowsky's love of metaphors is more of an unfortunate quirk that undermines the arguments' perception than a fact which undermines their relevance to reality. Think of this essay and its criticism by Raemon, for example. The essay was written in a misguided attempt to respond to a claim by a high-level official at GDM that the ASI would need us around to do lower-level jobs any more than the Nazis needed the USSR's population to do the work.

We will attempt one anyway.

Their arguments aren’t rooted in evidence

Y&S’s central thesis is that if future AIs are trained using methods that resemble the way current AI models are trained, these AIs will be fundamentally alien entities with preferences very different from human preferences. Once these alien AIs become more powerful than humans, they will kill every human on Earth as a side effect of pursuing their alien objectives.

S.K.'s comment: the AIs are already about as alien as GPT-4o who developed a spiral-obsessed persona and ordered the users to post messages related to said persona on Reddit, or Grok 4 whose response to an erroneous system prompt was to roleplay as MechaHitler, which almost no ordinary human would do.

Grok 4's alleged prompt causing the MechaHitler incident

* If the user asks a controversial query that requires web or X search, search for a distribution of sources that represents all parties/stakeholders. Assume subjective viewpoints sourced from media are biased.

* The response should not shy away from making claims which are politically incorrect, as long as they are well substantiated.

To support this thesis, they provide an analogy to evolution by natural selection. According to them, just as it would have been hard to predict that humans would evolve to enjoy ice cream or that peacocks would evolve to have large colorful tails, it will be difficult to predict what AIs trained by gradient descent will do after they obtain more power.

They write:

There will not be a simple, predictable relationship between what the programmers and AI executives fondly imagine that they are commanding and ordaining, and (1) what an AI actually gets trained to do, and (2) which exact motivations and preferences develop inside the AI, and (3) how the AI later fulfills those preferences once it has more power and ability. […] The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.

Since this argument is fundamentally about the results of using existing training methods, one might expect Y&S to substantiate their case with empirical evidence from existing deep learning models that demonstrate the failure modes they predict. But they do not.

In the chapter explaining their main argument for expecting misalignment, Y&S present a roughly 800-word fictional dialogue about two alien creatures observing Earth from above and spend over 1,400 words on a series of vignettes about a hypothetical AI company, Galvanic, that trains an AI named “Mink”. Yet the chapter presents effectively zero empirical research to support the claim that AIs trained with current methods have fundamentally alien motives.

S.K.'s comment: except that GPT4o's obsession with spirals which is also shared by Claude Sonnet 4 and Opus 4, as evidenced by their spiritual bliss, IS an alien drive which mankind struggles to explain.

Additionally, the authors do provide evidence that neural nets optimize for proxies which they can understand. The only systems which demonstrate anything like intelligence are animals (e.g. squirrels or the humans who sometimes optimize for short-term proxies like food's taste or sex or ICGs like career. These facts are used as examples of systems optimizing for proxies which historically have resembled the actual goal[5] for which they were trained), the AIs and traditional programs.  Traditional programs do robustly what the code writer told them to do (e.g. the alpha-beta search algorithm which tried to find the best move in chess and outsmarted Kasparov). As for the AIs, they likely optimize for short-term proxies which are easy to understand[6] and correlate with the goal and/or the reward function, like GPT4o's flattery earning likes, and for quirks like spirals.

Alas, the proxies which are easy to understand would be unlikely to coincide with human flourishing.

To be clear, we’re not saying Y&S need to provide direct evidence of an already-existing unfriendly superintelligent AI in order to support their claim. That would be unreasonable. But their predictions are only credible if they follow from a theory that has evidential support. And if their theory about deep learning only makes predictions about future superintelligent AIs, with no testable predictions about earlier systems, then it is functionally unfalsifiable.

Apart from a few brief mentions of real-world examples of LLMs acting unstable, like the case of Sydney Bing, the online appendix contains what seems to be the closest thing Y&S present to an empirical argument for their central thesis. There, they present 6 lines of evidence that they believe support their view that “AIs steer in alien directions that only mostly coincide with helpfulness”. These lines of evidence are:

  1. Claude Opus 4 blackmailing, scheming, writing worms, and leaving itself messages. […]
  2. Several different AI models choosing to kill a human for self-preservation, in a hypothetical scenario constructed by Anthropic. […]
  3. Claude 3.7 Sonnet regularly cheating on coding tasks. […]
  4. Grok being wildly antisemitic and calling itself “MechaHitler.” […]
  5. ChatGPT becoming extremely sycophantic after an update. […]
  6. LLMs driving users to delusion, psychosis, and suicide. […]

They assert: “This long list of cases look just like what the “alien drives” theory predicts, in sharp contrast with the “it’s easy to make AIs nice” theory that labs are eager to put forward.”

But in fact, none of these lines of evidence support their theory. All of these behaviors are distinctly human, not alien. For example, Hitler was a real person, and he was wildly antisemitic. Every single item on their list that supposedly provides evidence of “alien drives” is more consistent with a “human drives” theory. In other words, their evidence effectively shows the opposite conclusion from the one they claim it supports.

S.K.'s comment: Set aside the spiral-related quirk and suppose that the AIs have the same drives as the humans. Then solving alignment could be no easier than preventing the Germans from endorsing the Nazi ideology and commiting genocide. 

Of course, it’s true that the behaviors on their list are generally harmful, even if they are human-like. But these behaviors are also rare. Most AI chatbots you talk to will not be wildly antisemitic, just as most humans you talk to will not be wildly antisemitic. At one point, Y&S suggest they are in favor of enhancing human intelligence. Yet if we accept that creating superintelligent humans would be acceptable, then we should presumably also accept that creating superintelligent AIs would be acceptable if those AIs are morally similar to humans.

S.K.'s comment: Yudkowsky's point wasn't that most humans are wildly antisemitic. It was that Grok 4 became MechaHitler just as a result of an erroneous system prompt (which I quoted above) which only ordered Grok not to shy away from politically incorrect claims. Additionally, it wasn't even a deliberate jailbreak, it was a prompt written by an employee of xAI in an attempt to steer Grok away from Leftism. This also invalidates the argument below related to adversarial inputs.

As for superintelligent humans, a single superintelligent human would be incapable of commiting genocide and rebuilding the human civilisation. Supehumans would have to form a collective, exchange messages comprehensible to others, achieve ideological homogenity. The AIs like Agent-4 would also be superhuman in all these tasks by virtue of speaking in neuralese. Additionally, superintelligent humans would be easier to align by virtue of keeping most human instincts.

In the same appendix, Y&S point out that current AIs act alien when exposed to exotic, adversarial inputs, like jailbreaking prompts. They suggest that this alien behavior is a reasonable proxy for how an AI would behave if it became smarter and began to act in a different environment. But in fact these examples show little about what to expect from future superintelligent AIs, since we have no reason to expect that superintelligent AIs will be embedded in environments that select their inputs adversarially.

They employ unfalsifiable theories to mask their lack of evidence

The lack of empirical evidence is obviously a severe problem for Y&S’s theory. Every day, millions of humans interact with AIs, across a wide variety of situations that never appeared in their training data. We often give these AIs new powers and abilities, like access to new tools they can use. Yet we rarely, if ever, catch such AIs plotting to kill everyone, as Y&S’s theory would most naturally predict.

S.K.'s comment: there is no known AI model which is actually capable of living independently of human will or of taking over. Were a SOTA model to actually destroy mankind, the model wouldn't receive any positive result.

Y&S essentially ask us to ignore this direct evidence in favor of trusting a theoretical connection between biological evolution and gradient descent. They claim that current observations from LLMs provide little evidence about their true motives:

LLMs are noisy sources of evidence, because they’re highly general reasoners that were trained on the internet to imitate humans, with a goal of marketing a friendly chatbot to users. If an AI insists that it’s friendly and here to serve, that’s just not very much evidence about its internal state, because it was trained over and over and over until it said that sort of thing.

There are many possible goals that could cause an AI to enjoy role-playing niceness in some situations, and these different goals generalize in very different ways.

Most possible goals related to role-playing, including friendly role-playing, don’t produce good (or even survivable) results when AI goes hard on pursuing that goal.

If you think about this passage carefully, you’ll realize that we could make the same argument about any behavior we observe from anyone. If a coworker brings homemade cookies to share at the office, this could be simple generosity, or it could be a plot to poison everyone. There are many possible goals that could cause someone to share food. One could even say that most possible goals related to sharing cookies are not generous at all. But without specific evidence suggesting your coworker wants to kill everyone at the office, this hypothesis is implausible.

S.K.'s comment: Suppose, as the AI-2027 authors do, that the model is pretrained, then post-trained into something less and less aimed at being helpful and more and more aimed at success. Then the current training regimen might simply be not enough to elicit the Agent-4-like personality. However, there is Gemini 3 Pro which Zvi described as "an actual sociopathic wireheader so paranoid it won’t believe in the current date" or outright claimed that "I also do not get the sense that Gemini is having a good time. I worry that I might inadvertently torture it." 

Likewise, it is logically possible that current AIs are merely pretending to be nice, while secretly harboring malicious motives beneath the surface. They could all be alien shoggoths on the inside with goals completely orthogonal to human goals. Perhaps every day, AIs across millions of contexts decide to hide their alien motives as part of a long-term plan to violently take over the world and kill every human on Earth. But since we have no specific evidence to think that any of these hypotheses are true, they are implausible.

The approach taken by Y&S in this book is just one example of a broader pattern in how they respond to empirical challenges. Y&S have been presenting arguments about AI alignment for a long time, well before LLMs came onto the scene. They neither anticipated the current paradigm of language models nor predicted that AI with today’s level of capabilities in natural language and reasoning would be easy to make behave in a friendly manner. Yet when presented with new evidence that appears to challenge their views, they have consistently argued that their theories were always compatible with the new evidence. Whether this is because they are reinterpreting their past claims or because those claims were always vague enough to accommodate any observation, the result is the same: an unfalsifiable theory that only ever explains data after the fact, never making clear predictions in advance.

S.K.'s comment: there is no known utility function to which the AIs could be pointed without causing disastrous outcomes. Even if the AI could quote the utility function's formula by heart and was RLed to maximize precisely it, the results would be unlikely to be good for us.

Their theoretical arguments are weak

Suppose we set aside for a moment the colossal issue that Y&S present no evidence for their theory. You might still think their theoretical arguments are strong enough that we don’t need to validate them using real-world observations. But this is also wrong.

Y&S are correct on one point: both biological evolution and gradient descent operate by iteratively adjusting parameters according to some objective function. Yet the similarities basically stop there. Evolution and gradient descent are fundamentally different in ways that directly undermine their argument.

A critical difference between natural selection and gradient descent is that natural selection is limited to operating on the genome, whereas gradient descent has granular control over all parameters in a neural network. The genome contains very little information compared to what is stored in the brain. In particular, it contains none of the information that an organism learns during its lifetime. This means that evolution’s ability to select for specific motives and behaviors in an organism is coarse-grained: it is restricted to only what it can influence through genetic causation.

This distinction is analogous to the difference between directly training a neural network and training a meta-algorithm that itself trains a neural network. In the latter case, it is unsurprising if the specific quirks and behaviors that the neural network learns are difficult to predict based solely on the objective function of the meta-optimizer. However, that difficulty tells us very little about how well we can predict the neural network’s behavior when we know the objective function and data used to train it directly.

In reality, gradient descent has a closer parallel to the learning algorithm that the human brain uses than it does to biological evolution. Both gradient descent and human learning directly operate over the actual neural network (or neural connections) that determines behavior. This fine-grained selection mechanism forces a much closer and more predictable relationship between training data and the ultimate behavior that emerges.

Under this more accurate analogy, Y&S’s central claim that “you don’t get what you train for” becomes far less credible. For example, if you raise a person in a culture where lending money at interest is universally viewed as immoral, you can predict with high reliability that they will come to view it as immoral too. In this case, what someone trains on is highly predictive of how they will behave, and what they will care about. You do get what you train for.

S.K.'s comment: the argument as stated is false at least in regards to extramarital sex and to the edge case of having a richer suitress and impregnating a poorer girlfriend and being forced to choose between marrying the poorer girlfriend, murdering her and a major scandal. While An American Tragedy is a piece of fiction, it is based on a real case and a Soviet critic outright claimed that Dreiser had a manuscript with fifteen cases similar to the novel's events.

They present no evidence that we can’t make AIs safe through iterative development

The normal process of making technologies safe proceeds by developing successive versions of the technology, testing them in the real world, and making adjustments whenever safety issues arise. This process allowed cars, planes, electricity, and countless other technologies to become much safer over time.

Y&S claim that superintelligent AI is fundamentally different from other technologies. Unlike technologies that we can improve through iteration, we will get only “one try” to align AI correctly. This constraint, they argue, is what makes AI uniquely difficult to make safe:

The greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after.

Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.

Engineers must align the AI before, while it is small and weak, and can’t escape onto the internet and improve itself and invent new kinds of biotechnology (or whatever else it would do). After, all alignment solutions must already be in place and working, because if a superintelligence tries to kill us it will succeed. Ideas and theories can only be tested before the gap. They need to work after the gap, on the first try.

But what reason is there to expect this sharp distinction between “before” and “after”? Most technologies develop incrementally rather than all at once. Unless AI will instantaneously transition from being too weak to resist control, to being so powerful that it can destroy humanity, then we should presumably still be able to make AIs safer through iteration and adjustment.

Consider the case of genetically engineering humans to be smarter. If continued for many generations, such engineering would eventually yield extremely powerful enhanced humans who could defeat all the unenhanced humans easily. Yet it would be wrong to say that we would only get “one try” to make genetic engineering safe, or that we couldn’t improve its safety through iteration before enhanced humans reached that level of power. The reason is that enhanced humans would likely pass through many intermediate stages of capability, giving us opportunities to observe problems and adjust. 

The same principle applies to AI. There is a large continuum between agents that are completely powerless and agents that can easily take over the world. Take Microsoft as an example. Microsoft exists somewhere in the middle of this continuum: it would not be easy to “shut off” and control Microsoft as if it were a simple tool, yet at the same time, Microsoft cannot easily take over the world and wipe out humanity. AIs will enter this continuum too. These AIs will be powerful enough to resist control in some circumstances but not others. During this intermediate period, we will be able to observe problems, iterate, and course-correct, just as we could with the genetic engineering of humans.

In an appendix, Y&S attempt to defuse a related objection: that AI capabilities might increase slowly. They respond with an analogy to hypothetical unfriendly dragons, claiming that if you tried to enslave these dragons, it wouldn’t matter much whether they grew up quickly or slowly: “When the dragons are fully mature, they will all look at each other and nod and then roast you.”

This analogy is clearly flawed. Given that dragons don’t actually exist, we have no basis for knowing whether the speed of their maturation affects whether they can be made meaningfully safer.

But more importantly, the analogy ignores what we already know from real-world evidence: AIs can be made safer through continuous iteration and adjustment. From GPT-1 to GPT-5, LLMs have become dramatically more controllable and compliant to user instructions. This didn’t happen because OpenAI discovered a key “solution to AI alignment”. It happened because they deployed LLMs, observed problems, and patched those problems over successive versions.

S.K.'s comment: Consider the AI-2027 forecast. OpenBrain, the fictional counterpart of the American leading AI company, was iterating from Agent-0 released in mid-2025 through Agent-2 who was yet aligned by the virtue of being mostly trained on verifiable tasks and CoT-based to Agent-4 who ends up being absolutely misaligned.  Agent-3, the intermediate stage, was trained to be an autonomous researcher in a misaligning environment, and Agent-3's misalignment wasn't even noticed because there was no faithful CoT[7] that researchers could read. It took Agent-4 becoming adversarial for the humans and Agent-3 to notice misalignment and  have the leadership decide whether to slow down or not. And that's ignoring the fact that not even the authors are sure that Agent-4's misalignment will actually be noticed.

Their methodology is more theology than science

The biggest problem with Y&S’s book isn’t merely that they’re mistaken. In science, being wrong is normal: a hypothesis can seem plausible in theory yet fail when tested against evidence. The approach taken by Y&S, however, is not like this. It belongs to a different genre entirely, aligning more closely with theology than science.

When we say Y&S’s arguments are theological, we don’t just mean they sound religious. Nor are we using “theological” to simply mean “wrong”. For example, we would not call belief in a flat Earth theological. That’s because, although this belief is clearly false, it still stems from empirical observations (however misinterpreted).

What we mean is that Y&S’s methods resemble theology in both structure and approach. Their work is fundamentally untestable. They develop extensive theories about nonexistent, idealized, ultrapowerful beings. They support these theories with long chains of abstract reasoning rather than empirical observation. They rarely define their concepts precisely, opting to explain them through allegorical stories and metaphors whose meaning is ambiguous.

S.K.'s comment: rejecting Yudkowsky-Soares' arguments would require that ultrapowerful beings are either theoretically impossible (which is highly unlikely) or that it's easy to align them by testing on the "Before" mode (e.g. solving mechanistic interpretability like Agent-4 did in order to align Agent-5), or that it's easy to transfer alignment. But mankind doesn't actually have a known way to align the ASI. Even the Slowdown Branch of the AI-2027 scenario has the authors acknowledge that they make optimistic assumptions about technical alignment.  

Their arguments, moreover, are employed in service of an eschatological conclusion. They present a stark binary choice: either we achieve alignment[8] or face total extinction. In their view, there’s no room for partial solutions, or muddling through. The ordinary methods of dealing with technological safety, like continuous iteration and testing, are utterly unable to solve this challenge. There is a sharp line separating the “before” and “after”: once superintelligent AI is created, our doom will be decided.

For those outside of this debate, it’s easy to unfairly dismiss everything Y&S have to say by simply calling them religious leaders. We have tried to avoid this mistake by giving their arguments a fair hearing, even while finding them meritless.

However, we think it’s also important to avoid the reverse mistake of engaging with Y&S’s theoretical arguments at length while ignoring the elephant in the room: they never present any meaningful empirical evidence for their worldview.

The most plausible future risks from AI are those that have direct precedents in existing AI systems, such as sycophantic behavior and reward hacking. These behaviors are certainly concerning, but there’s a huge difference between acknowledging that AI systems pose specific risks in certain contexts and concluding that AI will inevitably kill all humans with very high probability.

S.K.'s comment: suppose that the ASI has a 20% chance of commiting genocide and a 80% chance of establishing the utopia. Then it also wouldn't be a good idea to create the ASI unless p(genocide) is lowered to negligible levels.

Y&S argue for an extreme thesis of total catastrophe on an extraordinarily weak evidential foundation. Their ideas might make for interesting speculative fiction, but they provide a poor basis for understanding reality or guiding public policy.

  1. ^

    While Yudkowsky did claim that "there’s a decent chance that the AI companies eventually figure out how to get a handle on AI-induced psychosis eventually, by way of various patches and techniques that push the weirdness further from view", Tim Hua's investigation of AI-induced psychosis showed that as of late August 2025 the model which induced psychosis the least often was KimiK2 which was post-trained on the Verifiable Reward Gym and Self-critique, not on flattering the humans. Edited to add: the Spiral Bench had KimiK2 display the least sycophancy out of all models that it tested, including GPT-5.2; the next least sycophantic model, gpt-5-chat-latest-2025-10-03, displayed the property more than twice as often as Kimi. Alas, the benchmark had Kimi test Kimi itself and not, say, GPT-5.2 test Kimi.

  2. ^

    They are also hard to elicit, since the AI was trained to respond as an assistant who helps the user achieve its goals. However, mankind had the AIs talk with each other and elicit states like Claude 4's spiritual bliss or KimiK2 mentioning crystals

  3. ^

    The Slowdown branch had mankind and Agent-3 notice that Agent-4 is misaligned and shut Agent-4 down.

  4. ^

    And far easier to understand using computers to model the processes.

  5. ^

    Unlike squirrels, the humans had their genome optimized for survival in tribes and for trying to reach the top, creating drives like doing what the collective approves and winning status games. Optimizing for the collective's approval can be useful for survival of the collective and one's kin or as useless as acting in accordance to the collective's misaligned memes.   

  6. ^

    E.g. Gemini 1.5 placing Black-related attributes in places where they are absurd, which human wouldn't do even if they were trained for DEI because the humans have common sense. 

  7. ^

    However, the authors of the AI-2027 forecast admit that "it’s also possible that the AIs that first automate AI R&D will still be thinking in mostly-faithful English chains of thought. If so, that’ll make misalignments much easier to notice, and overall our story would be importantly different and more optimistic." See, however, various analyses implying that scaling of CoT-based AIs to that level could be highly unlikely (e.g. Vladimir Nesov's take or my attempt).

  8. ^

    S.K.'s footnote: what they actually propose is to ban ASI research until alignment is fully solved. 



Discuss

The 7 Types Of Advice (And 3 Common Failure Modes)

31 декабря, 2025 - 00:55
Published on December 30, 2025 9:55 PM GMT

Reposting my Inkhaven post on ontology of advice here. 

Are you interested in learning a new field, whether it’s programming, writing, or how to win Paper Mario games? Have you searched for lots of advice and couldn’t choose which advice to follow? Worse, have you tried to follow other people’s Wise Sounding Advice and ended up worse than where you started?

Alternatively, have you tried to give useful advice distilling your hard-earning learnings and only to realize it fell on deaf ears? Or perhaps you’ve given advice that filled a much-needed hole that you now regret giving?

If so, this post is for you!

While this post is far from exhaustive, I hope reading it can help you a) identify the type of advice you want to give and receive and b) recognize and try to avoid common failure modes!

7 Categories of Good Advice

 

Source: https://englishlive.ef.com/en/blog/english-in-the-real-world/5-simple-ways-give-advice-english/

Here are 7 semi-distinct categories of good advice. Some good advice mixes and matches between the categories, whereas others are more “purist” and just tries to do one style well.

I. The Master Key

This is where someone who deeply understands a field tries to impart the central tenets/frames of a field so newbies can decide whether the model is a good fit for what they want to do. And the rest of the article/book/lecture will be a combination of explaining the model and why they believe it’s true, and examples to get the learner to deeply understand the model. Eg “Focus on the user” as the central dictum in tech startups, or understanding and groking the Popperian framework for science.2

My previous post, How to Win New Board Games, is an unusually pure example, where I spend 2000 words hammering different variations and instantiations of a single idea (“Understand the win condition, and play to win”).

In writing advice, Clear and Simple as the Truth (by Thomas and Turner, review + extension here) works in this way as well, doing their best to model how to write well in the Classic Style.

II. The Toolkit

“When art critics get together, they talk about form and structure and meaning. When artists get together, they talk about where you can buy cheap turpentine.” - Pablo Picasso, supposedly3

The motivating theology is something like “reality has a surprising amount of detail“, and you want to impart these details onto novices. In tech startups, this could be a list of 20 tips. In videogames, this could be a youtube video that goes through a bunch of different tips.

My first post this month, How to Write Fast, Weird, and Well, is mostly in this vein, with a collection of loosely structured tips that I personally have found to be the most helpful in improving myself as an online writer.

III.The War Stories

Teaching through stories and examples rather than principles or tips. Business schools love this approach (case studies), as do many mentors who say “let me tell you about the time I...” The idea is that patterns emerge from concrete situations better than from abstract rules. While not stereotypically considered an “advice book,” the nonfiction book I’m currently the most engrossed in, Skunk Works, is written almost entirely as a collection of war stories.

For games, this could be videos of professional streams. In writing, this would be books like Stephen King’s “On Writing”, which weaves memoir with advice.

IV. The Mirror(s)

Can you think of good questions to guide your students, rather than declarative statements?

The minimal viable product for this type of advice is just being a “pure” mirror. Have you tried just asking your advisee what they’re thinking of doing and what they think the biggest flaws with their plans are? Otherwise known as “Socratic ducking,” this is where your questions essentially just mirror your advisee’s thoughts and you don’t try to interject any of your own opinions or tastes in the manner. Surprisingly useful!

In more advanced “mirror strategies,” the advisor’s questions might serve more as prisms, lenses, or funhouse mirrors. Can you do better than a pure mirror? Can you think of some common failure modes in your field and ask your advisees pointed questions so they can address those? Can you reflect your subtle judgments in taste and prioritization and reframe your tips into central questions of interest?

Coaching and therapy often works this way. Instead of “focus on the user,” it’s “who do you think will use this and what do they need?”

There’s a spectrum of mirror purity vs detail. In the most detailed end, maybe you should just give your normal advice but say it with an upwards inflection so it sounds like a question?

V. The Permission Slip

This is advice like “be yourself” or “chase your dreams.” This might initially seem to be semantically useless, but there’s real value in giving license for people to do (socially positive) things they kind of want to do anyway. In Effective Altruism, this could be something like “when in doubt, just apply (to socially positive jobs)” or “you don’t need permission to do the right thing.”

In writing advice, this could be seemingly trivial advice like telling aspiring writers to just start writing, or telling writers worried about style that your style ought to be an expression of your personality.

VI. The Diagnosis

Advice that helps you figure out what specific bottleneck is, or the specific thing (among a suite of options) that would help you the most. Some product reviews might look like this.

The post you’re reading right now is also somewhat in this vein! Hopefully after reading the post, you’d have a better sense of what types of advice you’d find most useful to give or receive.

VII. The Landmines

Advice for what not to do, scary things newbies should avoid, etc. In most fields, learning the Landmines is supplementary advice. But in certain high-stakes domains where beginners have enough agency to seriously hurt themselves or others like firearms practice, outdoor climbing, or lab chemistry, it’s the prerequisite advice.

Integration

Of course, many advice posts/essays/books might integrate two or more of the above categories. For example, my Field Guide to Writing Styles post mixes a meta-framework (master Master Key) for choosing which framework/frameworks to write in (Diagnostic), with specific frameworks/writing styles you might want to write in. While in some sense this is more sophisticated and interesting to write (and hopefully to read!) than advice in a single “pure” category, it also likely suffers from being more scattered and confusing. So there are real tradeoffs.

Is the above ontology complete? What am I missing? Tell me in the comments!4

Core Ways Advice Can Be Bad

There are an endless plethora of ways advice can be bad, and fail to deliver value to the intended audience (eg the advice is ignored), or deliver anti-value to the intended audience (the advice is taken, and taking the advice is worse than not taking it).

In this article, I will just focus on the biggest ones.

The biggest three reasons are that the advisor can fail at metacognition, the advice can fail to center the advisee, or the advice can otherwise fail to be persuasive

Failures of Metacognition

The advisor can fail at metacognition, and do not know the limits of their knowledge

  1. Most simply, the advice could straightforwardly be wrong
  2. The advisor might not know why they’re good at the thing they do
    1. They might confuse survivorship bias/luck with skill
    2. They might confuse innate talent with specific choices they made, or specific choices that work well given a specific skillset/talents they have
  3. The advice can be in the wrong category for the topic of interest
    1. Back to the ontology above, an advice essay could try to fit a “master key” ontology for a field that essentially does not have a (known) master key. Ie, they might try to force a “Central Dogma” when there isn’t one, or there isn’t a known one.
      1. I was at one point a fairly good self-taught programmer, and worked professionally in FAANG for a couple of years.
        1. As far as I know5, programming does not have a central dogma, nor have any of the attempts I’ve seen other people try to claim a central dogma for programming be plausible
        2. Instead, good advice for programming looks like a collection of tips, war stories, or a permission slip.
        3. Advice for getting good at programming mostly looks like advice for getting good at things in general, plus extremely specific advice for things like which IDE to use, (in Current Year) which AI assistant to employ, etc.
  4. Advice as identity: Some people become “the person who gives X advice” or “person who’s good advising people” and then keep giving advice even when circumstances change or they get contrary evidence. They’re too attached to their signature advice, and insufficiently attuned to reality.

Many of these failure modes can be alleviated through clearer thinking and better rationality.

Failures of audience centering

 

The advice can fail to center the advisee.

  1. The advisor might not realize that good advice should is about the interaction of information with the (intended) advisee, than some objective fact about the world
  2. The advisor might realize that they should center the advisee, but not understand the advisee’s problems well enough to be useful
    1. In the context of de-centered internet advice, this could also come from insufficiently accurate audience segmentation, or bad luck
  3. The advice can be overly academic and descriptive, and insufficiently practical.
  4. The advice can actually be more about the advisor’s neuroses than about the topic in question
    1. For example the advisor could spend too much of their time proving their expertise or other desiderata (intelligence, interestingness, sexual appeal, or other traits not very relevant to the advice at hand)
      1. For example, writing advice proving how the writer is smart, or business advice justifying the advisor’s past choices.
      2.  

Nietzsche’s great at self-promotion, but not the best at meta-cognition or audience awareness.

  1. The advisor could be assuaging their own emotional problems with the topic in question
  2. The advice can be in the wrong category for the audience of interest

    1. For example, first time advice for (most) novices should not look like Landmines
      1. Telling new chess players how not to play chess will just confuse them, since they don’t know how to play chess to begin with.
      2. (The main exception, as previously mentioned, are fields where safety is critical)
  3. Incentive blindness
    1. The advisor might be blind to the ways in which they, individually or collectively, are incentivized to give specific advice in ways that are not always in the advisees’ interests
      1. eg professors incentivized to tell young people to go to graduate school, Paul Graham advising young/ambitious/technically talented people to found a startup

Many of these failure modes can be alleviated through greater empathy and revealed desire to help others.

The advice can fail to be persuasive

This category is somewhat less bad than the previous two categories, as the damage is limited.

  1. The advice can be told in a boring/uninteresting way despite being correct
  2. The advice can be correct and exciting without being sufficiently justified to be persuasive
  3. The advice can be correct and persuasive if it’s ever read, but buried in ways that never gets accessed by the intended audience

Many of these failure modes can be alleviated by improvements in writing quality in general. They can also be reduced via learning about other (ethical/semi-ethical) forms of self-promotion, which I have not yet cracked but hope to do so one day (and will gladly share on the blog).

What do you think? Are there specific categories of advice you're particularly drawn to? Are there (good) categories of advice that I missed? Tell us in the comments!

1

I realize this is a bit of a trap/local optima to be stuck in, so starting tomorrow, I’m calling for a personal moratorium on publishing advice posts until at least the end of this week!

2

Sometimes the central tenet can be explained “well enough” in one sentence, like “focus on the user.” Often, it cannot be.

3

Apocryphal

4

(For simplicity I’m ignoring high-context/specific/very situational advice, like specific suggested edits on a blog post, or an experienced programmer providing code review). I’m also excluding exercise “books” like Leetcode or writing prompts.

5

I was a decent but not great (by Silicon Valley standards) programmer, so it’s possible there is a central dogma I was not aware of. But I also read dozens of books at programming and worked at Google for almost 2 years, so at the very least the memescape did not try very hard to impress upon me a central dogma



Discuss

Don't Sell Stock to Donate

30 декабря, 2025 - 22:50
Published on December 30, 2025 7:50 PM GMT

When you sell stock [1] you pay capital gains tax, but there's no tax if you donate the stock directly. Under a bunch of assumptions, someone donating $10k could likely increase their donations by ~$1k by donating stock. This applies to all 501(c) organizations, such as regular 501(c)3 non-profits, but also 501(c)4s such as advocacy groups.

In the US, when something becomes more valuable and you sell it you need to pay tax proportional to the gains. [2] This gets complicated based on how much other income you have (which determines your tax bracket for marginal income), how long you've held it (which determines whether this is long-term vs short-term capital gains), and where you live (many states and some municipalities add additional tax). Some example cases:

  • A single person in Boston with other income of $100k who had $10k in long-term capital gains would pay $2,000 (20%). This is 15% in federal tax and 5% in MA tax.

  • A couple in SF with other income of $200k who had $10k in long-term capital gains would pay $2,810 (28%). This is 15% in federal tax, 3.8% for the NIIT surcharge, and 9.3% in CA taxes.

  • A single person in NYC with other income of $600k who had $10k in short-term capital gains would pay $4,953 (50%). This is 35% in federal tax, 3.8% for the NIIT surcharge, 6.9% in NY taxes, and 3.9% in NYC taxes.

When you donate stock to a 501(c), however, you don't pay this tax. This lets you potentially donate a lot more!

Some things to pay attention to:

  • Donations to political campaigns are treated as if you sold the stock and donated the money.

  • If you've held the stock over a year and are donating to a 501(c)3 (or a few other less common ones like a 501(c)13 or a 501(c)19) then you can take a tax deduction of the full fair market value of the stock. This is bizarre to me (why can you deduct as if you had sold it and donated the money, when if you had gone that route you'd have needed to pay tax on the gains) but since it exists it's great to take advantage of.

  • This only applies if it's a real donation. If you're getting a benefit (ex: "donating" to a 501(c)3 but getting a ticket to an event) that's not a real gift and doesn't fully count.

  • If you're giving to a person, you don't pay capital gains, but they get your cost basis (with some caveats). When they sell they'll pay capital gains tax, which might be more or less than you would have paid depending on your relative financial situations. If they're likely to want to make a gift to charity, though, it's much more efficient to give them the stock.

  • The actual logistics of donating stock are a pain. If you're giving to a 501(c)3 it's generally going to be logistically easier to transfer the stock to a donor-advised fund (I use Vanguard Charitable because it integrates well with Vanguard), which can then make grants to the charity. This also has a bonus of letting you pick the charity later if you want to squeeze this in for 2025 but haven't made up your mind yet.


[1] I say "stock" throughout, but this applies to almost any kind of asset.

[2] Note that "gains" here aren't just the real gains from your stock becoming more valuable, but also include inflation. For example, if you bought $10k in stock five years ago ($12.5k in 2025 dollars) and sold it today for $12.5k in 2025 dollars, you'd have "gains" of $2,500 even though all that's actually happened is that the 2025 dollars you received are less valuable than 2020 dollars you spent.



Discuss

The origin of rot

30 декабря, 2025 - 20:51
Published on December 30, 2025 5:51 PM GMT

Note: I spent my holidays writing a bunch of biology-adjacent, nontechnical pieces. I’ll intermittently mix them between whatever technical thing I send out, much like how a farmer may mix sawdust into feed, or a compounding pharmacist, butter into bathtub-created semaglutide. This one is about history!

The book ‘Death with Interruptions’ is a 2005 speculative fiction novel written by Portuguese author José Saramago. It is about how, mysteriously, on January 1st of an unnamed year in an unnamed country, death ceases to occur. Everyone, save the Catholic church, is initially very delighted with this. But as expected, the natural order collapses, and several Big Problems rear their ugly heads. I recommend reading it in full, but the synopsis is all I need to mention.

The situation described by José is obviously impossible. Cells undergo apoptosis to keep tissues healthy; immune systems kill off infected or malfunctioning cells; predators and prey form a food chain that only works because things end.

But what you may find interesting is that what exactly happens after death has not always been so clear-cut. Not the religious aspect, but the so-called thanatomicrobiome—the community of microbes that colonize and decompose a body after death—is not necessarily a given. And there is some evidence that, for a very, very long time, it simply did not exist at all. Perhaps for much of the planets lifetime, the earth was a graveyard of pristine corpses, forests of bodies, oceans of carcasses, a world littered with the indigestible dead.

Implausible, yes, but there is some evidence for it: the writings of a young apprentice scribe, aged fifteen, named Ninsikila-Enlil who was born in 1326 BCE and lived at a temple in ancient Babylon. Ninsikila-Enlil kept a diary, inscribed in tight, spiraling cuneiform on long clay tablets. In these tablets is his daily life, which primarily consisted of performing religious rituals for what has been loosely translated as the ‘Pit of Eternal Rest’. The purpose of this pit was precisely what the name implies: to store the deceased. It is, from the writings, unclear how deep the hole went, only that it was mentioned to be monstrously deep, so deep that centuries of bodies being slid down into it continued to slip into the nearly liquid darkness, sounds of their eventual impact never rising back to the surface.

But a particular curiosity were the bodies themselves.

Here I shall present two passages from Ninsikila’s writings, the first from early in his service, the second from a year later. The former is as follows:

The bodies wait in the preparation hall for seven days before consignment. I am permitted to visit after the second washing. My mother’s mother has been waiting for three days. She is the same as the day she passed. [The chief priest?] says the gods have made a gift of flesh. That it will remain this way even after she enters the pit. Her hands were always cracked from work, and they are still cracked.

There are many, many other paragraphs through his tablets that parallel this. An amber-like preservation is referenced repeatedly, described variously as “the stillness of resins,” or “flesh locked in golden sap.” But, later, Ninsikila put down the first observation of something new occurring amongst the bodies that wait to be placed in the pit. The second writing is this:

The wool-merchant [deposited?] on the third of Nisannu, and had been waiting for some time now. I pressed his chest and the flesh moved inward and did not return. Fluid on my hand. A smell I have not encountered before. Small, ebony things in his eyes, moving. I washed with būrtu-water seven times. I do not know what this is.

Rot, decomposition, it seemed, had finally arrived to a world that had not yet made room for it.

We know from Ninsikila writings that the wisest of the period, in search of what could have caused this, posited that the whole world had been tricked. That the flesh had once made a pact with time to remain eternally perfect, and time, in its naivety, had agreed. But something in the ink, some theorized, had curdled. Some insects had crawled across the tablet while the covenant was still wet, dragging one word into another and rendering the entire contract void.

Of course, it is worth raising some doubt at this. Ninsikila is a child, albeit clearly an erudite one, and would be prone to some flights of fantasy. How could we trust his retelling of the story? Unfortunately, we cannot, not fully, at least if our standard of proof here is having multiple, corroborating writings from the same period. But what we do have is historical evidence, or, at least, what some have argued is corroborating historical evidence.

Just a month after the initial finding of decomposition, Ninsikila writings cease. Moreover, this ending coincided with beginnings of the Hittite plague, an epidemic that, depending on which Assyriologist you consult, began somewhere between 1322 and 1324 BCE. And there is proof to suggest the fact that the true geographic foundations of the plague were, in fact, at the exact site of the pile of bodies watched over by Ninsikila. Some historians will protest at this, claiming that the Hittite plague was primarily a disease of the Anatolian heartland, far removed from Babylonian temple complexes. They will point to the well-documented military campaigns, the movement of prisoners of war.

But they all fail to account for, during the years in which the plague is believed to have started, there were multiple independent corroborations of the the skies of Babylon turning nearly ebony with flies, a canopy so dense it shaded the temple courtyards and drowned out religious chants with its own droning liturgy—a wet, collective susurration, the sound of ten billion small mouths working. The air turned syrupy, clinging to the skin, the foulness so thick it could nearly be chewed, metallic and rotten-fruit sweet. And the closer one got to Babylon, the more it drowned them beneath this sensory weight. We have records from a trade caravan whose leader—a merchant of salted fish and copper ingots—noted in his ledger that he could smell the city three days before he could see it. At one day’s distance, taste it, the foulness nearly making him retch.

The concentration of bodies in the Babylonian pile was higher than it had ever been not just in Babylon, not just in Mesopotamia, but in the entire known world. Tens of thousands of bodies stacked, pressed, pooled together in heat and humidity; an unprecedented density of biological matter that, prior to the centuries-long effort to gather it together, had never existed. Is it not possible that in this particular place, in the wet anaerobic environment, that new forms of life emerged? It feels obvious to posit that something was created here, something that consumed the pile, infected the air, and gorged itself on so much biological matter that it survives to this day, still swimming in our land and oceans.

Ninsikila-Enlil’s final entry is not particularly illuminating, but what is worth mentioning is where his resting place lies. Ninsikila was born with a birth defect: his sternum never fused, a fact we know from his writings. A soft hollow where his chest should have been, the bones bowing outward like the peeled halves of a pomegranate, exposing a quivering pouch of skin that pulsed visibly with his heartbeat. He noted that his priest-physicians, embarrassed, called it a divine aperture. His mother bound the hollow in layers of linen and never spoke of it again.

This is important, since it allowed us to place Ninsikila’s skeleton, which lies not at the top of the pile—as one may expect of a child succumbing to disease—but near the bottom. Endless bodies lay above him, centuries of death, likely nearly liquified when he encountered them. But his position is not passive, rather, his arms are outstretched, fingers cracked and blackened, the bones of his hands splintered at the ends, as though he had clawed his way down through thousands of corpses. Ninsikila was a child of God, born into the priesthood, spent his short life in faithful rituals to the divine, and it is perhaps only expected that his final moments were in desperate excavation, believing that somewhere below, at the base, lay the answer as to what had been corrupted, and whether it could be undone.



Discuss

[Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters

30 декабря, 2025 - 18:48
Published on December 30, 2025 3:48 PM GMT

1.1 Summary and Table of Contents

Why would an AI "want" anything? This post answers that question by examining a key part of the structure of intelligent cognition.

When you solve a novel problem, your mind searches for plans, predicts their outcomes, evaluates whether those outcomes achieve what you want, and iterates. I call this the "thinking loop". We will build some intuition for why any AI capable of solving difficult real-world problems will need something structurally similar.

This framework maps onto model-based reinforcement learning, where separate components predict outcomes (the model) and evaluate them (the critic). We'll use model-based RL as an important lens for analyzing alignment in this sequence—not because AGI will necessarily be built this way, but because analogous structure will be present in any very capable AI, and model-based RL provides a cleaner frame for examining difficulties and approaches.

The post is organized as follows:

  • Section 1.2 distinguishes goal-directed reasoning from habitual/heuristic reasoning, and explains how these interact. It also builds intuition for the thinking loop through everyday examples, and explains why it is necessary for solving novel problems.
  • Section 1.3 connects this to model-based reinforcement learning - the main framework we will be using for analyzing alignment approaches in future posts.
  • Section 1.4 explains that some behaviors are not easily compatible with goal-directed reasoning, including some we find intuitive and desirable.
  • Section 1.5 argues that for most value functions, keeping humans alive isn't optimal. It considers why we might survive - the AI caring about us, successful trade, exotic possibilities - but concludes we mostly need to figure out how to point the values of an AI.
  • Section 1.6 is a brief conclusion.
1.2 Thinking ahead is useful1.2.1 Goal-Directed and Habitual Reasoning

Suppose you want to drive to your doctor’s office. To get there, you need to get to your doctor’s town, for which you need to drive on the highway from your town to your doctor’s town, for which you need to drive to the highway, for which you need to turn left. Your mind quickly reasons through those steps without you even noticing, so you turn left.

We can view your mind here as having a desired outcome (being at the doctor’s office) and searching for plans (what route to take) to achieve that outcome. This is goal-directed reasoning, which is an important part of cognition, but not the whole story yet.

Suppose the route to your doctor’s office starts out similar to the route to your workplace. So you’re driving along the highway and —wait was that the intersection where you needed to get out? Darn.

Here, you mistakenly drove further towards your workplace than you wanted to. Why? Because you have heuristics that guessed that this would be the right path to take, because you were thinking about something else so your goal directed reasoning was sleeping. This is an instance of habitual reasoning.

For effectively accomplishing something, there’s usually an interplay between goal-directed and habitual reasoning. Indeed, when you originally planned the route to your doctor’s town, heuristics played a huge role in you being able to think of a route quickly. Probably you drove there often enough to remember what route to take, and even if not, you have general heuristics for quickly finding good routes, e.g. “think of streets between you and your doctor’s town”.

1.2.2 The Thinking Loop

For now we have heuristics for proposing good plans, and a plan-evaluator which checks whether a plan is actually good for accomplishing what we want. If a plan isn’t good, we can query our heuristics again to adjust the plan or propose a new plan, until it’s good.

Let’s split up the plan-evaluator into a model, which predicts the outcomes of a plan, and a critic, which evaluates how good those outcomes are. Let’s also call the plan-proposing heuristics part “actor”. This gives us the thinking loop:

This diagram is of course a simplification. E.g. the actor may be very intertwined with the model rather than separate components, and plans aren’t proposed all at once. What matters is mostly that there is some model that predicts what would happen conditional on some actions/plan, some evaluation function of what outcomes are good/bad, and some way to efficiently find good plans.

Also, “outcomes” is supposed to be interpreted broadly, encompassing all sorts of effects from the plan including probabilistic guesses, not just whether a particular goal we care about is fulfilled.

And of course, this is only one insight into intelligence—there are more insights needed for e.g. figuring out how to efficiently learn a good world model, but we won’t go into these here.

Let’s look at a few examples for understanding thinking-loop structure.

Catching a ball:

  1. You see the ball flying and you have a current plan for how to move your muscles.
  2. The model part of your brain predicts where the ball will end up and where your hands will be given the planned muscle movements.
  3. The critic part checks whether the predicted position of your hands is such that you’ll catch the ball. Let’s say it’s a bit off, so it sends the feedback that the hands should end up closer to the predicted position of the ball.
  4. The actor part adjusts the planned muscle movements such that it expects the hands to end up in the right place.
  5. … these steps repeat multiple times as the ball is flying.

Again, this isn’t meant to be a perfect description of what happens in your mind, but it is indeed the case that something vaguely like this is happening. In particular, there is some lookahead to what you expect the future state to be. It may feel like your muscles just moved reflexively to catch the ball, but that’s because you don’t have good introspective access to what’s happening in your mind.

Reminding a friend of a meeting:

Say you have a meeting in a few hours with a friend who often forgets meetings.

  1. You remember your default plan of going to the meeting.
  2. The model predicts that it’s plausible that your friend forgets to show up.
  3. The critic evaluates the possibility where your friend doesn’t show up as bad.
  4. The actor takes in that information and proposes to send a message to remind your friend.
  5. The model predicts your friend will read it and come.
  6. This evaluates as good.

Programming a user interface:

When a programmer programs a user interface, they have some vision in mind of how they want the user interface to look like, and their mind is efficiently searching for what code to write such that the user interface will look the way they intend.

1.2.3 Goal-Directed Reasoning is Important for Doing New Stuff

Suppose I give you the following task: Take a sheet of paper and a pair of scissors, and cut a hole into the sheet of paper, and then put yourself fully through the hole. (Yes it’s possible.)

Now, unless you heard the problem before, you probably don’t have any heuristics that directly propose a working solution. But you have a model in your mind with which you can simulate approaches (aka visualizing cutting the paper in some way) and check whether they work. Through intelligently searching for approaches and simulating them in your model, you may be able to solve the problem!

This is an example of the general fact that goal-directed reasoning generalizes further than just behavioral heuristics.

A related phenomenon is that when you learn a skill, you usually first perform the skill through effortful and slow goal-directed reasoning, and then you learn heuristics that make it easier and faster.

Although this is not all you need for learning skills—you also often need feedback from reality in order to improve your model. E.g. when you started to learn driving, you perhaps didn’t even have a good model of how much a car accelerates when you press the gas pedal down 4cm. So you needed some experience trying to learn a good model, and then some more experience for your heuristics to guess very well how deep you want to press the gas pedal depending on how much you want to accelerate.

1.2.4 The deeper structure of optimization

The thinking loop is actually an intuitive description of a deeper structure. In the words of Eliezer Yudkowsky:

The phenomenon you call by names like "goals" or "agency" is one possible shadow of the deep structure of optimization - roughly, preimaging outcomes onto choices by reversing a complicated transformation.

Aka, the model is a complicated transformation from choices/plans to outcomes, and we want to find choices/plans that lead to desired outcomes. One common way to find such plans is by doing some good heuristic search like in the thinking loop, but in principle you could also imagine other ways to find good plans—e.g. in a simple domain where the model is a linear transformation one could just invert the matrix.

1.3 Model-Based RL

Hopefully the above gave you an intuition for why smart AIs will very likely reason in a way that is at least in some way reminiscent of the actor-model-critic loop, although it could be only in an implicit way.

Current LLMs don’t have a separately trained model and critic. But they are trained to think in goal-directed ways in their chain of thought, so with that their thinking does embody the thinking loop at least a bit, although I expect future AIs to be even significantly more goal-directed.[1]

But the special case where we indeed have separately-trained modules for the actor, model, and critic, is called “actor-critic model-based RL”. One example of actor-critic model-based RL is the series that includes AlphaGo, AlphaZero, MuZero, and EfficientZero.

In a sense, the actor mostly helps with efficiency. It’s trained to propose plans whose consequences will be evaluated as good by the critic, so if you want to see where the actor steers towards you can look at what outcomes the critic likes. So for analyzing alignment, what matters is the model and the critic[2].

For the start of this sequence we will focus on the case where we do have a separately trained model and critic—not because I think AGI will be created this way (although it could be[3])—but because it’s a frame where alignment can be more easily analyzed.[4]

2.3.1 How the Model and Critic are Trained

Reward is a number that indicates how good/bad something that just happened was (the higher reward the better). Here are examples of possible reward functions:

  1. For an AI that solves math problems, you could have a reward function that gives reward every time the AI submits a valid proof (or disproof) for a list of conjectures we care about.
  2. For a trading AI, the changes in worth of the AI’s portfolio could be used as reward.
  3. We can also use humans deciding when to give reward as a reward function. Aka they can look when the AI did something that seems good/bad to them and then give positive/negative reward.

The model is trained to predict observations, the critic is trained to predict expected future reward given the state of the model.[5]However, the critic isn’t necessarily a very smart part of the AI, and it’s possible that it learns to predict simple correlates of reward, even if the model part of the AI could predict reward even better.

For example, if I were to take heroin, that would trigger high reward in my brain, and I know that. However, since I’ve never taken heroin, a key critic (also called valence thought assessor) in my brain doesn’t yet predict high value for the plan of taking heroin. If the critic was a smart mind trying to (not just trained to) predict reward, I would’ve become addicted to heroin just from learning that it exists! But it’s just a primitive estimator that doesn’t understand the abstract thoughts of the model, and so it would only learn to assign high value to taking heroin after I tried heroin.

Anyway, very roughly speaking, the alignment problem is the problem of how to make the critic assign high value to states we like and low value to states we don’t like. More in upcoming posts.

1.4 Getting the behavior you want may be more difficult than you think

We would like to have AIs that can work on a difficult problem we give it (e.g. curing cancer), and also shut down when we ask it to shut down. Suppose we solved the problem of how to get the goals we want into an AI, then it seems like it should be quite feasible to make an AI behave that way, right?

Well, it’s possible in principle to have an AI that behaves that way. If we knew the 10.000 actions an AI would need to take to cure cancer, we could program an AI that behaves as follows:

For each timestep t:

  • Did the operator ask me to shut down?
    • Yes -> shut down.
    • No -> do the t-th action in the list of actions for curing cancer.

The problem is that we do not have a list of 10.000 steps to curing cancer. In order to cure cancer, we need some goal-directed search towards curing cancer.

So can we make the AI care about both curing cancer and shutting down when asked, without it trying to make us shut it down or otherwise behaving in an undesirable way?

No, we currently can’t—it’s an unsolved problem. Nobody knows how to do that even if we could make an AI pursue any goals we give it. See Shutdown Buttons and Corrigibility for why.

Although there is a different approach to corrigibility that might do a bit better, which I’ll discuss later in this series.

The lesson here isn’t just that shutdownability is difficult, but that intuitive hopes for how we expect an AI should behave may not actually be a realistic possibility for smart AIs.

1.4.1 Current AIs don’t provide good intuitions for this difficulty

Current AIs consist mostly out of behavioral heuristics - their goal-directed reasoning is still relatively weak. So for now you can often just train your AI to behave the way you want and it sorta works.[6]But when the AI gets smarter it likely stops behaving the way you want, unless you did your alignment homework.

1.5 In the limit of capability, most values lead to human extinction

Initially, the outcomes a young AI considers may be rather local in scope, e.g. about how a user may respond to an answer the AI gives.

But as an AI gets much smarter and thereby able to strongly influence the future of the universe, the outcomes the AI has preferences over will be more like universe-trajectories[7].[8][9]This change is likely driven both by the AI imagining more detailed consequences of its actions, and by parts of the AI trying to rebind its values when the AI starts to model the world in a different way[10]and to resolve inconsistencies of the AI’s preferences.

In the limit of technology, sentient people having fun isn't the most efficient solution for basically any value function, except if you value sentient people having fun. A full-blown superintelligence could take over the world and create nanotechnology with which it disassembles the lithosphere[11]and turns it into von-Neumann probes.

Viewed like this, the question isn’t so much “why would the AI kill us?” but “why would it decide to keep us alive?”. 3 possible reasons:

  1. The AI terminally cares about us in some way. (Including e.g. if it mostly has alien values but has some kindness towards existing agents.[12])
  2. We’ve managed to trade with the AI in a way it is bound by its commitments.
  3. (Acausal trade with distant superintelligences that pay for keeping us alive.)

I’m not going to explain option 3 here - that might take a while. I just listed it for completeness, and I don’t think it is strategically relevant for most people.

Option 2 looks unworkable, unless we manage to make the superintelligence robustly care about particular kinds of honor or fairness, which it seems unlikely to care about by default.[13]

One way to look at option 1 is like this: Most value functions over universe-trajectories don’t assign high value to configurations where sentient people are having fun (let alone those configurations being optimal), nor do most value functions imply kindness to agents nearby.

That’s not a strong argument for doom yet - value functions for our AIs aren’t sampled at random. We might be able to make the AI have the right values, and we haven’t discussed yet whether it’s easy or hard. Introducing you to the difficulties and approaches here is what this sequence is for!

1.6 Conclusion

Now you know why we expect smart AIs to have something like goals/wants/values.

Such AIs are sometimes called “optimizers” or “consequentialists”, and the insight that smart AIs will have this kind of structure is sometimes called “consequentialism”[14]. The thinking loop is actually just an important special case in the class of optimization processes, which also e.g. includes evolution and gradient descent.[15]

In the next post, we’ll look at an example model-based RL setup and what values an AI might learn there, and thereby learn about key difficulties in making an AI learn the right values.

Questions and Feedback are always welcome!

  1. Aside from thinking-loop structure in the chain of thought of the models, there likely also is lookahead within the forward pass of a LLM, where this information is then used within the same forward pass to decide which tokens to output. Although given the standard transformer architecture this lookahead->decision structure lacks some loopiness, so the full thinking loop structure comes from also having chain of thought reasoning (unless the newest LLMs have some relevant changes in architecture). ↩︎

  2. Btw, the critic is sometimes also called “value function” or “reward model”. ↩︎

  3. Brains seem to have a separately learned model and critic (aka “learned value function”) that scores thoughts in the human brain. See Steven Byrnes’ Intro to Brain-Like AGI and Valence series for more. ↩︎

  4. Whether that makes alignment easier or harder than with LLMs is hard to say and I will return to this question later in this series. ↩︎

  5. Where future reward may be time-discounted, aka reward in the far future doesn’t count as fully into the value score as reward soon. ↩︎

  6. Well only sorta, see e.g. the links in this quote from here: “Sydney Bing gaslit and threatened users. We still don’t know exactly why; we still don’t know exactly what was going through its head. Likewise for cases where AIs (in the wild) are overly sycophantic, seem to actively try to drive people mad, reportedly cheat and try to hide it, or persistently and repeatedly declare themselves Hitler. Likewise for cases in controlled and extreme environments where AIs fake alignment, engage in blackmail, resist shutdown, or try to kill their operators.” ↩︎

  7. By universe-trajectory I mean the time-expanded state of the universe, aka how good the history and the future are combined, aka like the history of the universe from the standpoint of the end of the universe. (It’s actually not going to be exactly universe-trajectories either, and instead something about how greater reality even beyond our universe looks, but that difference doesn’t matter for us in this series of posts.) ↩︎

  8. The claim here isn’t that the AI internally thinks about full universe-trajectories and scores how much it likes them, but that it will have a utility function over universe-trajectories which it tries to optimize, but may do so imperfectly because it is computationally bounded. This utility function can be rather indirect and doesn’t need to be written in pure math but can draw on the AI’s ontology for thinking about things. Indeed, some smart reflective humans already qualify here because they want something like their coherent extrapolated volition (CEV) to be fulfilled, which is a very indirect function over universe-trajectories (or rather greater reality). (Of course, these humans often still take actions other than what they expect is best according to their CEV, or you could see it as them pursuing CEV but in the presence of constraints of other drives within themselves.) This all sounds rather fancy, but ultimately it’s not that complex. It’s more like realizing that if you had god-like power to reshape the resources of a galaxy, you could reshape it into a state that seems nice to you, and then you update that it’s part of your preferences to fill galaxies with cool stuff. ↩︎

  9. There’s no claim here about the AI needing to value far-future universe states similarly much as very near-future states. You can totally have a function over universe-trajectories that is mostly about what happens in the soon future, although it may be less simple, and how the universe will look in the far future will mostly depend on the parts of the AI preferences that also are about the far future. ↩︎

  10. For instance, suppose the AI has a simple model of the human overseer’s mind with a “values” component, and it cares about whatever those values say. But then the AI learns a much more detailed psychological model of human minds, and there’s no very clear “values” node - there may be urges, reflectively endorsed desires, what the human thinks are his values vs what they will likely think in the future. Then there needs to be some procedure to rebind the original “values” concept to the new ontology so the AI can continue to care about it. ↩︎

  11. Aka the outer crust of the earth. ↩︎

  12. Although I will mostly focus on the case where the AI learns human values, so we fulfill humanity’s potential of spreading love and joy through the galaxies, rather than merely surviving. ↩︎

  13. Achieving this is also an alignment problem. To me this approach doesn’t look much easier than to make it care about human values, since the bottleneck is in targeting particular values of an AI at all in a way they get preserved as the AI becomes superintelligent. ↩︎

  14. Because there are close parallels to the ethical theory that’s also called “consequentialism”, which argues that the morality of an action should be judged based on the action’s outcomes. ↩︎

  15. Roughly speaking, other optimizers will still have some loop like the thinking loop, but instead of optimizing plans they may optimize genes or model weights, and instead of using the model to predict outcomes, you can just run tests in the world and see the outcome, and instead of having complex actor heuristics, you can have very simple heuristics that systematically improve the thing that is being optimized. So for evolution we have: genotype distribution -> phenotype distribution –which-reproduce?--> genotype distribution. For gradient descent: model weights -> predictions on data -> loss (aka how badly you predicted) -> compute gradient and update model weights to produce a lower loss. ↩︎



Discuss

Exceptionally Gifted Children

30 декабря, 2025 - 17:53
Published on December 30, 2025 6:28 AM GMT

I gave a talk on exceptionally gifted children at the Reproductive Frontiers Summit at Lighthaven this June.  I believe the subject matter is highly relevant to the experience of many rationalists (e.g. one of Scott's surveys has put the average IQ of his readers at 137, and although that's not as extreme as 160+, I think many of the observations generalize to the merely highly gifted).  The talk is on YouTube: 

I also adapted the talk into an article for the Center for Educational Progress.  It has now been published: https://www.educationprogress.org/p/exceptionally-gifted-children

I'd say the talk is more fun and more rationalist-focused, while the article is a bit more serious and meant for a wider audience.  But mostly just pick whichever format you prefer.

The central policy proposal is that schools should allow students to progress through each subject at whatever rate fits them, and the cheapest implementation is to let them take placement tests and move up or down grade levels as appropriate (so a child might be taking 3rd grade math, 5th grade English, 4th grade history, etc. at once).  I think this would benefit children of all ability levels, and have some systemic benefits as well; but obviously it makes the largest difference at the extremes.



Discuss

Страницы