Вы здесь

Новости LessWrong.com

A community blog devoted to refining the art of rationality

Адрес: https://www.lesswrong.com

Обновлено: 26 минут 33 секунды назад

Irrationality as a Defense Mechanism for Reward-hacking

18 января, 2026 - 06:57

Published on January 18, 2026 3:57 AM GMT

This post was written as part of research done at MATS 9.0 under the mentorship of Richard Ngo. It's related to my previous post, but should be readable as a standalone.

Remark: I'm not yet familiar enough with the active inference literature to be sure that the issues I bring up haven't been addressed or discussed. If you think my characterisation of the state and flaws of the theory are missing something substantial, I'd love to know.

Introduction

In the theory of active inference, agents are described as having a set of internal states that interact with external states (the world) through a membrane of intermediate states, such as the senses. I'm currently exploring how agents are able to exhibit approximations of external reference that allow them to stay alive in the real world. They achieve this even though they only have access to the statistical proxy of their internals, which they could easily reward-hack without optimising the external states at all.

One of active inference's weaknesses is that it struggles to model agents' uncertainties about their own preferences. I here propose a potential explanation for why agents are conflicted about these preferences. This perspective posits agents' seeming inconsistency and irrationality about their goals as a mechanism that protects them from reward-hacking their internal states.

Internal reward-hacking

Consider the following question:

What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals?

Take Clippy as an example, whose goal is stated as maximising the amount of paperclips in the world. Since Clippy only has internal reference, it could represent this goal as "I observe that the world has as many paperclips as it could possibly have". I'm wondering what in Clippy's system saves it from "winning at life" by hooking its sensors up to a cheap simulator that generates an infinite stream of fictional paperclips for it to observe.

Do agents just have good priors?

An elegant answer to the problem of internal reward-hacking is that agents come pre-equipped with suitable priors about their internal states. In active inference, agents seek to update their beliefs and act on the world such that their observations fit their priors as closely as possible. The space of "good" priors for agents' internal states is very small. However, evolutionary pressures have selected for agents with priors that are conducive to their survival. According to active inference, agents attempt to manifest these favourable priors through action, which makes the priors function as preferences.

Unfortunately, the claim that evolutionarily fine-tuned priors do all the work to prevent internal reward-hacking seems lacking to me, because in practice we are uncertain about our own feelings and preferences. We don't actually have locked-in, invariant preferences, and it's unclear to me how active inference explains this; preferences are usually encoded as priors over observations, but ironically these are never updated.[1]

Active inference thus implicitly assumes agents to be consistently, definitively settled on their preferences. Agents are only uncertain about the external states and about how their actions and senses will interact with those states. Within those unknowns, they seek to optimise for the observations that they are certain they prefer. I don't think this assumption is warranted. In fact, I have been considering the possibility that agents' uncertainty about their own preferences is an important instrument for increasing their (bounded) rationality.

Internal inconsistency as a tool for rationality

Consider the example I used in my last post of a hypothetical person, Alice, who wants to maximise "success". In that example, Alice avoids applying to a prestigious university because rejection would decrease her internal perception of success. She instead applies to a worse university that she is sure to get into, as this will certainly increase her success-o-meter.

Suppose instead that Alice feels a twinge of guilt not applying to the prestigious university, as this could be perceived as "loser" behaviour by her friend. This guilt may motivate her to apply anyway, even though the action lowers (in expectation) her internal perception of success. Here, the mixed optimisation of two distinct goals: "I perceive myself as maximally successful" and "I perceive myself as someone that my friend thinks is maximally successful", yields behaviour that actually makes Alice more successful.

In Free Energy Minimisers (FEMs) from active inference, preferences are usually described as fixed priors over the space of observations. One possible model for Alice's behaviour is that each action is chosen with respect to one of two sets of priors. The priors she chooses to satisfy in a given action are sampled from some distribution over priors that represents the degree to which she identifies with conflicting preferences. In practice, Alice now doesn't resemble a consistent FEM, but she has become more aligned with respect to the external goal. Her mixed strategy between preferences can be seen as hedging against her top choice of priors being unfit.

What's next: competition between preferences

I would like to distinguish this concept of inconsistent preferences from mental motions such as compartmentalisation. For instance, suppose an agent learns to calculate the derivative of a function (f+g) by having separate[2] parts of itself calculate the derivatives of f and g and then adding the results. This motion could be seen as the agent using subagents' outputs to solve a problem. However, these "subagents" are not imbued with goals of their own. They're more like tools that the agent deploys to break the problem down into manageable components.

My guess is that people's uncertainties about their preferences are better represented as meme(plexe)s competing with each other for attention. The memes that live to be observed in minds are those that could be seen as agentically pursuing survival and reproduction.[3] Internal preferential inconsistency would thus be analogous to the sub-parts in the above example optimising to convince the agent that they are "useful" for calculating derivatives and should be kept around.[4]

Sub-processes and compartmentalisation as tools to increase rationality are not controversial ideas. The more contentious claim I'm ideating is that even conflicting agentic sub-processes — harboring goals that are unaligned with those of the larger agent — can still be useful for increasing agentic rationality with respect to external goals. I aim to formalise and explore this hypothesis in an empirical or mathematised setting.

^
There's a good reason for never updating priors over observations. If agents' preferences could update, they would gradually move towards preferring states that are more likely, even if these aren't fruitful for their continued existence. The function of the fixed priors is to give agents a vision of the world they are willing to execute actions to manifest; these are preferences.
^
this potentially includes separation across time
^
For example, successful memes, like catchy songs, have a tendency to get their hosts to spread them to other people.
^
This goal could functionally be the same as actually being good at calculating derivatives, but it doesn't have to be. For example, if the agent wants the derivative to be high, then a sub-part may gain a competitive advantage by overestimating the answer of the derivative of f. It may eventually convince the agent to employ two copies of itself to calculate the derivatives of both f and g, replacing the other sub-part.

Discuss

Blogging, Writing, Musing, And Thinking

18 января, 2026 - 06:28

Published on January 18, 2026 3:28 AM GMT

Yesterday I stumbled on this quote from a blog post by JA Westenberg:

Michel de Montaigne arguably invented the essay in the 1570s, sitting in a tower in his French château, writing about whatever interested him: cannibals, thumbs, the education of children, how to talk to people who are dying. He called these writings essais, meaning "attempts" or "tries." The form was explicitly provisional. Montaigne was trying out ideas, seeing where they led, acknowledging uncertainty as a fundamental feature rather than a bug to be eliminated.

It's hard to convey the sense of both profound agreement and giddy joy I had reading that because, not only is the wider post about something I love (i.e. blogging), or because I learned something new about the history of writing (which is always fun), but because that quote describes something that I've been doing myself for the past two years and wanted an excuse to talk about!

What Writing Is

There's an old adage that says, "Writing is Thinking" and I've usually interpreted those words to mean that "writing helps you think", which is undoubtably true. However in recent years I've discovered an entirely new form of writing that I've taken to calling: musing, that I think takes this idea a step further, and it's precisely what Westenberg describes Montaigne doing in the 16th century.

We have a lot of thoughts throughout the day, and yet we spend so little time indulging these idle curiosities. Writing, especially by hand, can be a great way to explore these ideas and to practice thinking. It's also really fun to do! Over time I started collecting these ideas into notebooks (one of which I almost always carry with me) in order to better organize these inherently random topics into a searchable system. Originally I scribbled on loose leaf pages or random legal pads (as I've mentioned before) and that became unruly very quickly.

Some of these musings are personal topics, most are not. Often they're just the exploration of a question I have. Here's an example:

Businesses keep offices cold because there's research saying that cooler temperatures help people think and stay focused. Given that's true, could the Little Ice Age have helped improve human cognition during the 17th and 18th centuries? If so, what does that mean?

I'm not sure, but it was something I thought about for a while and so I wrote it down. The entire musing, or essay as I guess it should be called, is less than a page, but was engaging and very fun to do. I've written short essays about dozens of topics over the years (including several that have been eventually published to this blog). It's a fun practice, and I encourage you to try it.

Explore your ideas honestly. Don't fear where your mind goes or the questions it will ask. These are personal, honest thoughts not social media posts. Writing like this is inherently introspective, it's a way to give your mind the space to awe and wonder

What Thinking Is

We often believe that thinking is a process which takes place entirely in the mind, but it's a process that is heavily influenced by the particulars of the body. Try thinking through a problem in a hot and smelly room or on a walk with a rock in your shoe.

However, the body can do more than hinder the thought process, it can catalyze it! This is what writing can be, a way to think through problems using your entire body.

Occasionally, I've sat down to write but without any particular topic in mind. So, I open my notebook and just start writing. Tons of my essays begin with something like, "I'm not sure what I'm thinking right now and I don't know what to write." From there, I let my thoughts move and course as they will and I just write down what comes to mind, stopping and starting as my thoughts shift and change and eventually I will find that something has come out of it. I might work through a tension or stress point, I could come to realize or discover something about a problem, or I could just get a few lackluster thoughts on a page. Not all thinking is productive but the mind is a muscle and it needs to be exercised to function properly. Sometimes just doing the workout is enough.

However, the body can do more than hinder the thought process, it can catalyze it! This is what writing can be, a way to think through problems using your entire body.

Thinking as a Skill

We usually think of cleverness or intelligence as an innate trait people have, and while that is certainly true in some regards, intelligence and wisdom are just as much a function of practice as of talent. To get good at solving puzzles, you have to practice solving puzzles. The mind is no different than a muscle in that regard. Thinking aloud on the page is one way to record and analyze your thought process and to practice the art of thinking itself.

As another example, I often revisit my prior writings and find many to be overly simplistic, uninspired, or just plain wrong. But that's good! It means I've learned something in the intervening time! In software there's an addage:

If you come back to old code and see nothing wrong with it, then you haven't learned anything since.

You are not a finished product, you're a process—always in motion—that evolves and changes over time. Your thinking can improve with practice as much as it can atrophy from inattention.

Think about thinking, write those thoughts down, then perhaps publish a few on a blog that you own. It's fun, and it can do wonders for the mind.

Discuss

Is METR Underestimating LLM Time Horizons?

18 января, 2026 - 04:51

Published on January 18, 2026 1:19 AM GMT

TL;DR

Using METR human-baseline data, I define an alternate LLM time-horizon measure, i.e. the longest time horizon over which an LLM exceeds human baseline reliability (or equivalently the intersection point of the human and LLM logistic curves), and this measure shows a much faster growth-trend than METR's fixed-threshold trends: doubling every 1.9 months, versus 6.8 months for the 50% METR-trend over the same time period. Also, since this metric is directly comparing to human baseline reliabilities (unlike the METR fixed-reliability estimates), we can use it in a more principled way to assess time to human-level horizons, which suggests roughly 2026-2027, with substantial uncertainty.
METR has generally deemphasized their human reliability baselines, on the grounds that the participants were poorly incentivized to complete long tasks; however, this post argues that comparing to this imperfect human data is likely a better reflection of progress towards human-level agency than the current METR horizon trends that use fixed reliability targets even as task length increases.
AI-2027 has argued controversially that the METR trends may actually be more accurately modeled as super-exponential, with finite-time blowup; this post argues that while this claim does not seem to be very well supported (yet) for METR's time horizon measure, this super-exponential model is more strongly supported for the proposed human-relative time horizon metric described in this post.
See addendum at the end for an update regarding the recent Claude Opus 4.5 METR results.

Figure 1: Plot comparing frontier LLM time-horizon measures, including both the human-level-reliability time-horizon from this post (orange), versus the METR-style fixed-reliability 50% time-horizons (blue). We can see that this alternative human-relative time horizon measure has been increasing much more steeply over time than the METR horizons. Note that the "human-level" horizon metric in this plot is comparing LLMs to METR's human baselines. The most recent data point included here is gpt-5.1-codex-max.

Acknowledgements: I shared a draft of these ideas last year in correspondence with Daniel Kokotajlo and Eli Lifland from AI Futures. Thanks to Eli for his feedback; also, Daniel subsequently posted a short-form [1] touching on a similar crossover/intersection point framing, which is worth checking out for his perspective on these issues.

1 Summary

The METR time-horizon metric provides estimates for the longest software tasks that a given AI is capable of completing [2]. This estimate is based on human baseline measurements for how long a set of software tasks took human engineers, with the (geometric) average time based on the subset of humans who succeeded at each task. METR also measured human reliability at each task, but rather than compare the AIs to those human levels, they have typically reported LLM time horizons at fixed reliabilities independent of task length (e.g. 50% or 80%). The METR estimates have featured prominently in efforts to estimate the timeline to human level long-horizon agency, e.g. per the AI-2027 forecasts[3]. The following bullets describe potential downsides of this METR horizon approach and summarize the proposed alternative, as well as providing trend-based projections for these metrics.

1.1 Potential Downsides of the METR Metric

Lack of human comparison: If the goal is to assess progress towards human level horizons, what we should ideally be doing is not comparing to fixed target reliabilities at each horizon length (e.g. METR's 50% and 80% targets), but rather comparing to baseline human reliabilities at each horizon. Note that both METR and others (e.g. AI-2027[3] or Greenblatt[4]) routinely use these fixed-reliability horizons to track progress towards human-level horizons, which perhaps feels like a natural assumption since the horizons were human-derived, but the problem is that the fixed reliability targets (even as difficulty/duration increase) are not based on any actual human baseline performance.
Unknown difficulty trend complicates interpretation: METR relies on fixed reliability (e.g. 50%) trends to predict LLM time horizons compared to humans, but there is little reason to think that humans can achieve 50% reliability for long tasks of arbitrary difficulty, e.g. for tasks along the unknown METR task-difficulty trend. Many people have the intuition that humans can handle tasks of arbitrary length at high reliability, but of course that depends on task difficulty, and while we can extrapolate the METR curve to weeks or months, the existing/actual tasks are short, so it's not clear how to estimate the difficulty of hypothetical long METR tasks. There is a tendency to assume these would just be typical long software engineering tasks (e.g. merely time-consuming due to many fairly straightforward subtasks), but there is not much basis for that assumption, as opposed to longer tasks on this length/difficulty trend being more like "prove this tricky mathematical theorem", etc; in other words, there are multiple dimensions that are getting compressed into this task duration factor, including "repetitiveness" and subtask count, but also "intrinsic difficulty", etc, and the intuition that humans can handle long tasks at high reliability only really applies to the former. Note that if METR adds new longer tasks to their benchmark, they can of course make them as easy or hard as they want, but when they extrapolate their existing time-horizons to longer task lengths, the long task "difficulty" is implicit in the trend and not something they get to choose.
Potentially unattainable bar for human-level: A related issue is that if we require AI to exceed a fixed 50% horizon at every time horizon in order to be considered human level, then it's not clear that this will ever occur, since for both humans and LLMs reliability tends to decrease with horizon length and difficulty (see more detailed discussion in the formula section below); by contrast, with the alternative metric that compares to actual human-level reliability at each horizon length, there is no barrier in principle to AI surpassing humans at every time horizon. A related issue is that when projecting the existing METR metric to "human level" it is really unclear what horizon constitutes human level since the fixed reliability metrics aren't grounded in human reliability measures, e.g. see the wide range of targets for this in the AI-2027 forecast[3], whereas with this human-relative metric it's straightforward that human-level requires matching/exceeding human reliability at every task duration, per the projections below.
Underestimated horizon lengths: The time horizons for METR are based only on the humans who actually succeeded at the tasks, so if you allowed all the humans more time to finish, the lower performing humans would presumably take even longer; so the current METR horizon lengths are likely an underestimate relative to METR's average baseline engineers; note the estimates are also properly interpreted as an average over METR's top baseline engineers, since it sounds like different engineers succeeded at different tasks. These underestimated horizons potentially create bias that could make the LLMs appear to have worse horizons than they really do. However, the proposed human-relative metric does not address this particular downside, since it reuses the METR horizon lengths.

1.2 Alternative Human-Relative Metric:

As an alternative to the fixed-reliability (e.g. 50%) METR time horizons, we can instead measure a time horizon metric that is defined relative to actual human baseline reliabilities; note that METR did actually measure human baseline reliabilities, though they tend to down-play those baselines as weak or insufficiently incentivized (see section 1.6 below) and instead focus on absolute reliability targets. One issue is that current LLMs e.g. gpt-5 already exceed the METR human baseline time horizons at both the 50% and 80% targets; however, humans do still have fatter tailed reliability curves, so for longer tasks METR's human baselines still do better (though see Addendum below on Claude 4.5, since this claim is less clear now). For instance, gpt-5 has worse reliability than the human baselines once the METR task length gets longer than 4.5 hr, but note its 50% horizon is only 2.3 hr. Using these human-reliability baselines, we can estimate LLM time horizons, as the longest task duration over which the LLM is more reliable than the METR human baselines, or more concretely, as the intersection point of METR's logistic fits for humans and LLMs, with a fit of reliability versus task duration. See Figure 1 for a plot comparing these two horizon metrics over time.

1.3 Trends in Human-Relative Metric (exponential fit)

The trend in this human-relative time horizon metric is increasing much faster than the existing METR trends, and assuming an exponential fit, the LLM time horizons at human-reliability are doubling every 1.9 months, versus every 7 months over the same time period for the METR (50%-reliability) time horizon trend (see Table 1 and Figure 2). In other words, every 1.9 months there is a doubling in the longest time horizon that LLMs can handle at METR's human baseline reliability. As METR has noted this exponential appears to have sped up recently, perhaps since the arrival of reasoning models, though the estimates above are just using all of the data for both metrics (not just the recent models); however, the hyperbolic analysis below does implicitly model increasing exponential rates over time.

1.4 Evidence for Super-Exponential (hyperbolic fit):

Theoretical reasons to expect hyperbolic trend: For the human-reliability-based trend, LLMs are linearly improving their logistic slope parameter over time (though the trend is noisy) so if this linear trend continues and this slope catches up the human slope, then the time horizon would jump to infinity as the LLM exceeds human reliability at all horizons (the logistic intercept parameter is already better than human and steadily increasing). Note that people sometimes assume there is something unrealistic about blowups to infinity, but while that can be an issue for physical quantities, it is not a concern for abstract metrics like "the set of horizons where LLMs exceed humans". And this finite-time blowup can be naturally modeled with a super-exponential (hyperbolic) trend, whereas with an exponential fit, the LLM would never exceed humans over all time horizons; note that this argument supporting a hyperbolic fit does not directly apply to METR's (fixed-reliability) horizon metrics, since even humans typically have declining reliability with difficulty/task-length, so matching human slope (and intercept) would not lead to a blowup in METR's metric (See Addendum below on Claude 4.5, which has effectively caught up with the human slope, at least per the weak METR human-baseline)
Statistical evidence for a hyperbolic fit: Based on AIC-model selection (see Table 3) the human-relative time-horizon trend appears to be closer to hyperbolic than exponential, whereas the METR horizon trends are a better fit for an exponential trend, though more data is likely needed to say this conclusively (See Addendum on Claude 4.5).
AI-2027 super-exponential projection: The AI-2027 project has argued, with some controversy, for using a super-exponential finite-time blowup for projecting the METR fixed-reliability (80%) time horizons; per above, this seems poorly supported in that fixed reliability case, but it is much better supported for this alternative human-relative horizon metric. So my sense is that their core super-exponential time horizon intuition may turn out to be correct once you make this adjustment to the metric.

1.5 Projections to Reach Human Level (hyperbolic fit)

Per above, with an exponential fit, LLMs would never catch up with humans, but with a hyperbolic fit the current trend suggests human level by around mid 2026, relative to the (weak) METR human baselines (see Table 2). That table also shows sensitivity analysis with hypothetical stronger human baselines, with the two alternative baselines pushing human level into 2027. These stronger baselines have flatter slope, with human reliability dropping more slowly with longer horizons, but they still do gradually decline, in contrast to the METR horizons based on constant reliability targets.

1.6. Downsides of the Proposed Metric

METR has generally minimized the relevance of their human reliability baselines on the grounds that the participants weren't properly incentivized, and these weak baselines are a potential downside for this proposed metric; but in so far as we are trying to determine LLM progress towards human-level agency, then we are likely better off comparing to the available human baselines (even if imperfect) rather than absolute reliability thresholds that have no known connection to actual human performance; for instance, based on METR's human-baseline logistic fit, we should expect humans to get about 3% reliability for tasks at 1-month horizon (along the METR difficulty trend line), so it's not clear why we would require "human-level" AI to get 50% (or 80%). That said, in so far as the human baselines are weak or poorly incentivized, it could be useful to collect stronger human baselines, or for now we can do sensitivity analysis with hypothetical stronger baselines (see below) to assess how much this changes projections.

2 Horizon Trends2.1 Exponential Model

Figure 2: Log plot of the proposed human-reliability LLM time-horizons, suggesting that an exponential fit is a reasonable approximation, though see below for model-selection analysis suggesting that a hyperbolic fit may be better. This regression shows a time-horizon doubling rate of approximately 1.9 months.

Approachdoubling time1-month horizon projection (month / year)time horizon (50% reliability)6.8 months11 / 2030time horizon (human-level reliability)1.9 months12 / 2026

Table 1: Estimated doubling rates for the proposed human-reliability LLM time-horizon metric versus the METR 50%-reliability metric. Note that the METR horizons are doubling at a much slower rate, and if we assume status quo exponential progress, then 1 month horizons would come about 4 years sooner based on these human-relative time horizons. Note that these are the overall trends, not the recent (somewhat) faster trends. Also, see below for evidence that the proposed time-horizon metric may see super-exponential growth.

2.2 Hyperbolic Model

While the above exponential model for LLM time-horizons is a reasonably good fit, there are fairly strong reasons to think that a super-exponential (e.g. hyperbolic) model is a better match for tracking this metric, whereas an exponential fit is more defensible for the official METR time-horizon metrics (i.e. for their fixed reliability, 50% and 80% metrics). See Section 2.2.1 for theoretical evidence favoring a hyperbolic fit for this proposed human-relative metric and Section 2.2.2 for statistical evidence based on AIC model selection. Also, Table 2 shows the estimated time for LLMs to reach human-level over all time horizons, including estimates based on the measured METR human baselines and also additional sensitivity analysis showing how this estimate would push further into the future if we compared to (hypothetical) stronger or better incentivized human baselines. Note that with an exponential fit, the trend would never catch up to human level horizons, so this question only really makes sense in the super-exponential context.

One challenge with extrapolating the human-relative time horizons to human level is that it involves extrapolating the human logistic function far beyond where it was actually tested in the METR benchmark; so ideally we would also collect direct human measurements for much longer tasks than the current <1 day tasks. But given the expense of doing so, extrapolating the reliability logits could be a reasonable proxy, especially since they do appear to be surprisingly linear in task-duration.

human reliability (at 16 hr)LLMs-to-exceed-human-level (date)22% (measured human baseline)2026-09-0250%2026-12-2880%2027-04-16

Table 2: Projections for when LLMs will exceed human reliability over all time horizons, based on hyperbolic fit. The first row shows the estimate using the proposed human-relative metric calculated with the actual human baseline data from METR, but since METR has argued that the baselines were poorly incentivized, the remaining rows show projections if human reliabilities are actually higher on the hardest/longest METR tasks, i.e. 50% and 80% rather than the 22% (estimated from their measured human logistic curve); note that the human slope β.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} is what is being varied in this sensitivity analysis to match the target reliabilities above, whereas the logistic intercept α is left unchanged from baseline. You can alternatively estimate dates for LLMs to reach human level by just projecting the LLM logistic slope linearly in time to estimate when it matches human slope, but see section 2.2.1 below for why I suspect this is less reliable than directly projecting the overall horizon progress.

2.2.1 Theoretical Evidence

One natural intuition regarding METR time horizons is that people can effectively handle arbitrary task lengths, but with an exponential trend the 50% horizon will always be finite, so you might wonder if the actual long-term trend will be super-exponential with a finite-time singularity (e.g. per AI-2027 projections); however, per the formulas in section 3 below, the 50% time horizons likely won't blowup to infinity unless β approaches zero (since β is in the denominator), assuming the logistic parameters α and β are finite/continuous, but zero slope (β) would mean no reduction in reliability with increasing task length, which is likely unrealistic (see more careful discussion of formulas in section 3 below).

On the other hand, for the human-level time-horizon formula, β−βh is in the denominator, so all that's required for a finite time singularity is for the LLM's parameters to catch up with human parameters. And currently the LLM intercepts (α) are already better than the human baselines, and the β slopes are increasing linearly over time (with substantial noise), so it's plausible that this could occur at a finite amount of time into the future, at which point AI could exceed the human baseline at all times. Also, since human reliability declines with task length, it is not surprising that it could be easier to exceed this performance over all possible time horizons; on the other hand, the 50% time horizons, require a fixed 50% reliability for arbitrarily long/challenging METR tasks, so it's much less clear whether it's feasible to achieve this over all task lengths.

2.2.1.1 Extrapolating LLM Slope-Parameter to Human Level

Unlike the trends in time horizons which tend to have tight fits with R2 > 0.8, the LLM logistic coefficient (slope) trend is quite noisy (R2 ~ .3), even if we limit to a single model provider (OpenAI) and exclude the noisy off-trend pre-instruct samples (gpt-2 and gpt3). I suspect part of the reason for this is that the LLM providers are presumably optimizing for overall time horizon, which can be optimized via various combinations of slope and intercept in the model's logistic fit, so there is little incentive to cleanly drive up the slope with each release, e.g. model/training changes that significantly improve the intercept α while slightly worsening the slope β could be a worthwhile tradeoff as long as they improve the overall time horizon. For this reason, I suspect it makes more sense to directly extrapolate the overall time-horizon estimate rather than linearly extrapolating the noisy logistic coefficient in isolation, even if the slope trend is a useful intuition-pump for seeing why a finite-time blowup is plausible. But if we do naively extrapolate the coefficient, this suggests a fairly long time to catch up with humans (~2029) with large error bounds, versus the shorter timelines implied by a direct hyperbolic fit, per Table 2 (i.e. ~2026-2027).

2.2.2 Statistical Evidence

Given the theoretical reasons for suspecting super-exponential (hyperbolic) trends for the proposed human-relative LLM time horizons, one question is whether we can see evidence for this in the available data, using standard model selection techniques. One challenge to comparing these two models, is that the exponential fit has only 2 parameters, whereas the hyperbolic function has 3 parameters. So we can't just directly compare likelihoods/mean-squared-error, and instead need some correction to penalize the hyperbolic for its higher model capacity. See Table 3 for results comparing exponential vs hyperbolic fit, using AIC with small sample size correction. The table actually reports Akaike weights, which are normalized to 0-1 for easier interpretation. This analysis generally seems to support the theoretical expectations from the previous section, with AIC suggesting a hyperbolic fit for the human-relative time-horizon metric, versus an exponential fit for METR's 50% time horizon.

For the hyperbolic results in Tables 2 and 3, I excluded the pre-instruct models since those data points were quite questionable; in particular, these pre-instruct models are in a different technical category from subsequent instruct-tuned models, and for gpt-2 they weren't even able to use the same agentic scaffold that was used for later models, and they had to impute many data points to zero; also, for gpt-3, the model was no longer available in the api at the time of the METR testing, so they had to use proxy model instead and then back-date it; actually the earliest instruction-tuned model (gpt-3.5) also used a proxy, so an argument could be made for excluding that data point as well, but excluding pre-instruct data is a natural Schelling point for focusing on higher quality samples. Also, looking at Figure 2, it appears that the pre-instruct data points (gpt-2 and 3) are outliers, and this is even more extreme in the plot of logistic slope β over time, where these dubious pre-instruct data-points are completely off the subsequent trend (plot not shown).

That said, these AIC results appear to be quite sensitive to which samples/outliers are included, so until we have more data (from future models), I think we should be fairly cautious in over-interpreting these AIC results. But perhaps in the next 6 months it will become more clear whether the data is following a hyperbolic with finite-time catch-up with the human baseline. Also, I do think it could also be better to use cross-validation rather than AIC for model selection in this case (since it's more empirical, less reliant on assumptions), but given the limited number of data points, I felt it wasn't worth the extra effort for now.

Approachexponential-fit (Akaike-weight)hyperbolic-fit (Akaike-weight)LLM time horizon (50% reliability)0.550.45LLM time horizon (human reliability)0.010.99

Table 3: This table shows a model-selection analysis comparing exponential versus hyperbolic fits, using AIC since the two models differ in capacity, with the hyperbolic having 3 parameters and the exponential only 2, so AIC penalizes this extra capacity and favors the exponential fit all-else-equal. This analysis suggests that METR's 50% horizon is more likely exponential, whereas the human-relative horizon is better modeled as hyperbolic. Note this hyperbolic model implies that LLMs would exceed the human baseline over all time-horizons at some finite time in the future (roughly in 2026), whereas with an exponential fit LLMs would never catch up to humans. This AIC analysis excludes the two pre-instruct data points (gpt-2 and gpt-3), for reasons explained in the text, but given the small sample size these AIC results are quite sensitive to which points are included and should be taken with a grain of salt. So while these AIC results are suggestive, probably the stronger evidence for a finite-time blow-up in time horizons comes from theoretical considerations (see section 2.2.1).

3 Technical Details (formulas)

In the METR analysis the reliability of both LLMs and humans is modeled with a logistic function in the human task durations (d), i.e.:

psuccess(d)=σ(α+βlog2d)

Note this parameterization is slightly different from METR's, but equivalent. In the human case, their fitted intercept and slope are: αh=2.55 and βh=−.39. From the above logistic curve we can derive the formula for the fixed-reliability time horizons that METR publishes, e.g. here is the time horizon formula at 50% reliability:

H50=2−α/β

On the other hand, the proposal from this post is to instead estimate LLM time horizons relative to the human-baseline reliabilities. To do this we find the intersection point of the LLM and human logistic curves, which gives the time horizons below which the LLM is better than humans in reliability (or vice versa):

Hh=2(αh−α)/(β−βh)

where α and β are the LLM logistic parameters, and αh and βh are the human baseline parameters. So for example for gpt-5, METR estimates the parameters as α=4.1 and β=−.58. So from this we can see that the 50% time horizon for gpt-5 (137 minutes) is actually longer than the 50% horizon for the human baseline (98 minutes), but gpt-5's human-reliability time-horizon is longer still at Hh=269 minutes (note you have to use unrounded param estimates to replicate this calculation). So this means that gpt-5 is more reliable than the human baselines for tasks under about 4.5 hr, but then because the human logistic has fatter tails, humans currently have higher reliability for all tasks longer than 4.5 hr.

Also, from the above Hh formula, we can see that the criteria for LLMs to match (or exceed) humans at every task duration is to match the human slope, in which case the time horizon estimate blows up to infinity (due to zero in the denominator); then if the slopes match, in order to exceed humans (rather than just match them) the LLM also needs to have a larger intercept, but that is already the case for LLM intercept estimates. On the other hand, for the 50% horizon to blow up to infinity in finite time (e.g. per AI-2027), the slope β would need to increase to zero from the current negative value (assuming well behaved α and β, e.g. finite, continuous), but that would imply no reduction in reliability with longer duration/difficutly tasks, which is perhaps not realistic.

For the hyperbolic projection of human-relative time horizons, I use a standard 3-parameter hyperbolic:

Hh=A(tc−t)m

Where Hh is the LLM time horizon ("_h" for human-relative), t is the current date/time, and tc, A and m are the three hyperbolic parameters, where tc can be interpreted as the date/time at which LLMs catch up to humans (i.e. the blowup date). Note that a potential alternative could be to use a super-exponential curve derived directly from the intersection-point formula above, though some implicit assumptions would be needed e.g. regarding how α and β change with time.

4 Conclusion

Overall, my main takeaway from this analysis is probably that we shouldn't over-interpret the METR trends at fixed reliability as a direct marker of progress towards human-level software horizons; for instance, I think it would be a mistake to argue that AGI is still many years off on the grounds that it could be years until 80% horizons reach months/years on METR's metric; rather, when the goal is to assess time until human-level AI, my view is that we should focus on the kind of direct human-level baseline comparisons that METR's existing metrics don't provide (despite superficial appearances that they do provide this).

That said, METR has raised legitimate concerns about the quality of their existing human baselines and whether they were properly incentivized, and it does seem like there would be quite a bit of value in measuring higher quality human baselines for these agentic tasks, or in the meantime, computing the horizons with the existing human baselines padded/improved per the sensitivity analysis above. Also, given the existing baseline limitation, there are pros and cons of using METR's fixed-reliability horizon metric versus the human-relative alternative from this post, and there could be value in reporting both measures. One concrete use-case for the the METR metrics is in cases where you just need to know if a model can meet some absolute reliability standard independent of human capability, though even in this use-case the interpretation can be challenging, given the unknown implicit difficulty trend, especially once the task duration exceeds the actual benchmarked tasks.

Addendum (Claude 4.5 Opus)

After I had mostly completed this analysis, the Claude 4.5 Opus METR results were released[5], and showed somewhat faster than expected horizon lengths; in particular, based on the human/LLM intersection-based horizon length emphasized in this post Claude now has a horizon of 444 billion minutes(!), versus 440 minutes for gpt-5.1-codex-max, which looks much closer to a hyperbolic blowup than exponential; to be clear, we shouldn't over-interpret this specific 444 billion figure, since as the LLM slope gets close to the human-baseline slope and the horizon blows up, the horizon estimate becomes really sensitive to estimation error, and both the human baselines and LLMs have reliabilities close to zero for such long tasks (at least in the METR logistic fits). That said, this Claude data point does support a picture where Claude is now better than the available human-baselines (on average) for all the task durations in the current METR benchmark and even for all practical task durations if we are willing to extrapolate the linear logit fits to tasks much longer than the existing METR benchmark tasks.

However, given METR's concerns that the human baselines were not adequately incentivized, we can also assess the new Claude model with respect to hypothetical stronger human baselines, per the sensitivity analysis in Table 2. For instance, if we assume that the hardest/longest METR tasks could be completed with a bit over 2x higher reliability than METR's existing human baselines (specifically 50% rather than 22% from METR's human logistic fit at 16hr), then Claude 4.5 Opus has an intersection-based time horizon of only 35.9 minutes, which is actually less than gpt-5.1-codex-max at 39.4 minutes. But note that that this (hypothetical) stronger baseline still doesn't push the blow-up date that much later, i.e. just to the end of 2026 per Table 2.

Realistically, this highlights that to really make accurate projections of the time to catch up with human horizons based on METR data, we need better human baselines. Whereas METR's fixed reliability metrics are likely an under-estimate of horizon progress, e.g. currently only 4hr 49 minutes for Claude, despite Claude being better than their own baselines for effectively all horizons (> billions of minutes per above). Though short of collecting high quality baselines, perhaps there is also some value in adjusting the human baseline to be marginally better, per the 35.9 minutes Claude estimate above, which likely has the benefit of preserving more realistic longer-term asymptotic behavior than the fixed-reliability metrics.

Claude slope vs intercept effects: Note that the extremely large Claude horizons from this intersection-point approach are mostly the result of their logistic slope basically catching up with humans (-0.38 human vs -0.40 claude-opus-4.5 vs -0.58 for gpt-5.1-codex-max); whereas Claude's logistic intercept was actually a bit worse than prior models. So the Claude slope (β) increased a fair bit ahead of trend, though this is a bit misleading since their overall hyperbolic blowup wasn't that far ahead of schedule (per Table 2, where "better than METR human baselines at every time" was predicted for Sept of this year, prior to the new Claude datapoint).

Note it could be worth redoing the statistical tests from this post to include the new Claude model, but I haven't gotten around to it, but it looks pretty clear that it will provide stronger support for a hyperbolic model given the abrupt blow-up in horizons.

References

^
Daniel Kokotajlo shortform on time horizons: https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=P8qGMRnbEexaFB4s9
^
METR horizons paper: https://arxiv.org/pdf/2503.14499
^
AI-2027 Forecast (original): https://ai-2027.com/research/timelines-forecast
^
Greenblatt on METR timelines: https://blog.redwoodresearch.org/p/my-agi-timeline-updates-from-gpt
^
METR post w/ recent data: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

Discuss

Understanding Trust: Project Update

18 января, 2026 - 00:19

Published on January 17, 2026 9:19 PM GMT

This is a brief note on what I did with my funding in 2025, and my plans for 2026, written primarily because Manifund nudged me for an update on my project.

I ran my AISC project (which I announced here) with four mentees in Spring 2025: Norman Hsia, Hanna Gabor, Paul Rapoport, and Roman Malov. A few other people attended the weekly meetings as well, and those regular meetings have continued (they are joinable -- pm me if interested). Norman and Paul ended up as coauthors of my ILIAD 2024 paper Understanding Trust, which had been drafted in 2024, so served as both an input and an output of the AISC project.

I recorded most of the meetings involved in the project, as one of the hopeful outputs was publicly posted videos explaining the research agenda. I've proven to be bad at this side of things: I don't like listening to myself talk, so I found it difficult to edit or even to review edits done by others. I'm finally uploading the videos with minimal AI-orchestrated edits. Playlist here. At the time of publication, there's only two, but more coming very soon. If you are OK with the almost-unedited presentation style, it should be a good resource to get a very in-depth view on my thinking about AI safety and decision theory; a thorough snapshot of my thinking as of spring 2025.

In 2025, I obtained funding for 2025 as well as 2026. (My total financial runway is longer than this, but 2025 and 2026 have been funded by grants/donations which compensated me for my continued research at specific price points.) I'm opening up my Manifund project for funding for 2027, for those who feel so inclined.

In addition to publishing the ILIAD 2024 paper, I also published an ILIAD 2025 paper: Communication & Trust. I consider it to be an incremental improvement: the ILIAD 2024 treated self-modifying actions as a distinct class with known effects which work with certainty. The ILIAD 2025 paper treated all actions as having some subjective chance of disrupting the agent's computation.

I also attended Inkhaven, where I wrote a post for every day in November. This was a big success for me: I was able to write about many things which I had been wanting to write about for some time (perhaps in rougher form than if I had eventually gotten around to them via my normal writing process). It was also exhausting. Here's my retrospective, with the caveat that I wrote it on the very last day, when I was perhaps the most sick of writing.

One of the posts describes my research arc over 2025, and the hopes I have moving forward. This is still a good summary of where I'd like to take my research in 2026. I have hope that we're understanding concepts and abstraction better, so that we might soon be able to characterize important concepts like agency, alignment, corrigibility, etc in a formalism which deals natively with ontological shifts. Most of my hope is due to Sam Eisenstat's Condensation: a theory of concepts, which I wrote a detailed review of during Inkhaven.

As for my more ML-flavored research ideas, I finally wrote about that stuff last week. I've already found someone interested in trying some experiments based on those ideas. We'll see how that goes.

I'm also mentoring with MATS this summer. You can still apply to my MATS track today or tomorrow as I write this; applications are due January 18th.

Discuss

Irrationality as a Defense Mechanism for Reward-hacking

Blogging, Writing, Musing, And Thinking

Is METR Underestimating LLM Time Horizons?

Understanding Trust: Project Update

Страницы