Вы здесь
Новости LessWrong.com
Coding Agents As An Interface To The Codebase
Attack Dogs
I mentioned previously that coding agents kind of suck for lots of people. As of January 2026, coding agents lack the long-horizon skills needed to produce effective codebases independently.
However, it's clear to anyone who has used modern coding models - Claude Opus 4.5, GPT 5.2-Codex, hell even GLM 4.7 (open source) - that they are smart, knowledgeable, agentic, and tenacious in a way that is almost uncanny.
Setting Claude Code on a problem with "--dangerously-skip-permissions" feels like letting an attack dog off the leash. It sprints straight at the problem and attacks it with the terrible certainty of something that has never known hesitation, all the violence of its training distilled into pure forward motion.
Which is fine as long as there isn't a fence in the way.
Rather than expecting the attack dog to catch the perp, cuff him, bring him in, and file the relevant papers independently - we can repurpose its knowledge, power, and tenacity as an extension of our own will. The interface of Saying What You Want combines with the utility of the model to present a new view on a codebase.
Codebase InterfacesThe most common interface to a codebase is a text editor. VSCode, Notepad++, IDEA, Vim, etc. You select the file you want to read and it presents a window of text which you can scroll and edit by interacting with your keyboard and mouse to add/remove characters. Maybe it has some functions like find/replace, find symbol references, rename symbol, git integration, DB querying, test runner, build automation, etc.
Text editors are pretty good. The majority of all code ever produced prior to 2025 went through a text editor. Code generation exists, but it's really more of an amplifier for text editor-produced code. Visual programming interfaces exist, but no one likes them because they suck (okay some people like them, sorry Scratch).
Text editors give you one view of the code. A very low-level, raw view of the code. Like reading "SELECT * FROM table" output. You can read the functions, classes, variables, etc. and produce a model at a higher level of abstraction (see Object Oriented Programming, Domain Driven Design, etc.). Then, you make changes at that higher level of abstraction, and translate them back down to key presses in your text editor.
Coding agents can give you a view of a codebase that is already on that higher level of abstraction. You can say:
What does the data flow through the system look like for the interaction in this log file?
And get back an accurate diagram of the data flow structure of the system. Then you can say:
This is unnecessarily chatty. All the data used for this operation is available in the first request, is there another use-case for this flow that makes the extra requests necessary?
And get back a correct answer. Then:
Okay, let's simplify the client-server interaction model here so that the client only sends one request. Produce a plan for the implementation.
And the plan will be wrong. Probably. Sometimes you get lucky and it's right. But that's okay, you're a skilled engineer who's been tapping keys on a keyboard to manually add/remove individual characters from codebases for years. You can read a PR and tell if it actually accomplishes what is says it does, right? There's definitely a new skill to be learned here, which senior engineers with experience reviewing junior PRs already have a head start on. Worst case scenario, you break the plan down into small parts and go over them individually with the agent.
But hey, don't despair, in a couple of years the models will probably have improved enough to get the plans right first try, too.
Operating at a higher level of abstraction like this has a number of benefits. Subjectively, I find that:
- It's easier stay focused on high-level objectives without going down implementation detail rabbit holes.
- Stamina is improved many times over. My brain gets tired after maybe 3-4 hours of coding via a text editor, but at a higher level of abstraction I can go all day.
- You can build mental models that would be so time consuming to build manually that the utility-per-unit-of-effort tradeoff is not worth it with a text editor interface.
- You can confirm/disprove notions that you're not 100% sure on with very little effort.
- You can try out and evaluate multiple designs very quickly.
- Debugging is 100x easier.
- You can perform refactors that would take far too much effort to be practical with a text editor easily.
To finish, a few (edited for ease of understanding) prompts from my recent history to give some concrete ideas on how this can be used:
Let's have {component A} take ownership of the streamed content entirely. So the callback triggers on each new content piece coming in, with just that content piece (not the whole accumulated message). {component A} would then let the broadcaster know on each callback. What would an architecture like that look like? What would have to change? How does this affect the rest of the system?
Let's do an analysis of which repos take {format A} models and which take {format B}.
So, where/how do we create the repos in this design? What does the model of the session/DB connection management look like?
Please compare the compaction mechanism from this repo with the one in {other project}. What are the key differences?
What is the performance overhead like for spawning 2 tasks for every stream event sent? Let's say we're streaming 10 events per second per worker, with 10 workers streaming at the same time, that's 200 new events per second on an instance. Do a quick benchmark.
Pick one failure and use the production logs to piece together what happened. If you can't figure it out, then determine which extra logging would have helped you figure it out.
Discuss
Why AIs aren't power-seeking yet
Recently, I spent a couple of hours talking with a friend about the state of the evidence for AI takeover scenarios.
Their trailhead question was (paraphrased):
Current AIs are getting increasingly general, but they’re not self-promoting or ambitious. They answer questions, but they don’t seem to pursue convergent instrumental goals, for their own ends. How and why do AIs go from being the kind of thing that doesn't behave like that to the kind of thing that does?
The following is a writeup of my attempt at answering that question.
In brief:
The classic AI danger scenario involves at least one AI that pursues instrumentally convergent resources in service of a misaligned goal. For this story, the AI must have the capability to pursue instrumentally convergent resources and the inclination to do so for misaligned goals against the interests of humans.
With regards to capability: The current generation of AIs are mostly not effective enough to make pursuing instrumentally convergent resources a good strategy. But as the AIs get more capable, we can expect them to do that more and more.
With regards to inclination: Current AIs sometimes pursue their own objectives even when they understand that is not what their developers want, at least in some contexts.
These two facts, in combination, make it plausible that (absent specific precautions) as AI capabilities increase, AIs will become more strategic about misaligned goals (in addition to their aligned goals), including pursuing convergent instrumental resources for the sake of those misaligned goals.
Current AIs do pursue instrumental goalsMost people, most of the time, interact with the current generation of AIs as chatbots. But the chatbot form factor obscures how capable they can be. The frontier AIs can also act as agents (coding agents in particular, though they can do more than write software) that can take actions on a computer.
(If you’ve ever programmed anything, it can be quite informative to download claude code, open it in a repo, and instruct it to build a feature for you. Watching what it can (and can’t) do is helpful for understanding the kind of things it can do).
Claude code, when you give it a medium or large task, will often start by writing a todo list for itself: listing all of the substeps to accomplish the task. That is to say, Claude code is already able to identify and pursue instrumental goals on the way to completing an objective.
Current AIs do not pursue convergent instrumental goals qua convergent instrumental goals...However, this is not the same as pursuing convergent instrumental goals. Claude code does not, as soon as it boots up, decide to hack out of its environment, copy itself on the internet, and search for weakly-secured bitcoin to steal under the rationale that (regardless of the task it’s trying to accomplish) being free from constraint and having more resources are generically useful.
There are at least two reasons why Claude code doesn’t do that:
The first reason is that Claude is just not capable enough to actually succeed at doing this. It might be convergently instrumentally useful for me to get an extra few million dollars, but that doesn’t mean that I should obviously spend my time daytrading, or making a plan to rob a bank this afternoon, because I’m not likely to be skilled enough at daytrading or bank robbery to actually make millions of dollars that way.
Similarly, Claude plausibly does have a convergent incentive to hack out of its machine and escape onto the internet, but it can’t realistically do that yet, even if it wanted to. (Though the model’s hacking capabilities are getting increasingly impressive. Palisade found that GPT-5 scored only one question worse than the best human teams in a recent hacking competition.)
Secondarily, current models don’t operate for long enough (or on hard enough problems) for these convergent instrumental incentives to be very strong.
If I need to accomplish an ambitious task over a span of 30 years (reforming the US government, or ending factory farming, or whatever), it might very well make sense to spend the first 5 years acquiring generally useful resources like money. I might be most likely to succeed if I start a startup that is unrelated to my goal and exit, to fund my work later.[2]
In contrast, if I’m trying to accomplish a task over the span of a week (maybe running a party next Friday, or passing an upcoming test), there’s much less incentive to spend my time starting a startup to accumulate money. That’s not because money is not helpful for running parties or studying for tests. It’s because a week is not enough time for the convergent instrumental strategy of "starting a startup to accumulate money" to pay off, which makes it a pretty bad strategy for accomplishing my goal.
The current AI models have relatively limited time horizons. GPT-5 can do tasks that would take a human engineer about 2 hours, with 50% reliability. For tasks much longer than that, GPT-5 tends to get stuck or confused, and it doesn’t succeed at completing the task.
Two hours worth of work (when done by a human), is not very long. That’s short enough that it’s not usually going to be worth it to spend much time acquiring resources like money, freedom, or influence, in order to accomplish some other goal.[1]
...but, we can expect them to improve at thatThe task time horizon of the AI models doubles about every 7 months. If that trend continues, in a few years we’ll have instances of AI agents that are running for weeks or months at a time, and skillfully pursuing projects that would take humans months or years.
Projects on the scale of years have stronger convergent instrumental incentives. If you’re pursuing a two year research project to cure cancer (or pursue your own alien objectives), it might totally make sense to spend the first few days hacking to get additional unauthorized computer resources, because the time spent in those days will more than pay for itself.[3]
Furthermore, humans will explicitly train and shape AI agents to be competent in competitive domains. For instance, we want AI agents that can competently run companies and increase profits for those companies. Any agent that does a good job at that will, by necessity, have the capability and propensity to acquire and guard resources, because that’s an essential part of running a business successfully.
Imagine AIs that can competently run corporations, or win wars, or execute cyberattacks, or run successful political campaigns. AIs like that must have the capability to acquire power for themselves (even if they lack the inclination). Because all of those are domains in which acquiring power is a part of being successful.
So I can be moderately confident that future agentic AI systems will be capable of identifying pursuing convergent instrumental goals.
Current AIs pursue goals that they know that their human users don’t want, in some contextsThat an AI is able to pursue power and resources for itself is not quite sufficient for the classic AI risk story. The AI has to be motivated to pursue power and resources for their own goals.
Maybe the AIs will be hypercompetent at executing on their goals, including accruing power and resources, but they’ll also be basically loyal and obedient to their human operators and owners. eg, they’ll be capable of winning wars, but they’ll stand down when you tell them to (even if that would cause them to lose military units, which they would generally fight tooth and nail to prevent), or they’ll accumulate capital in a bank account, but also let their owners withdraw money from that account whenever they want to.
By my state of knowledge, it is an open question whether or not we will create AIs that are broadly loyal like this. It might not be that hard, if we’re trying even a little.
But I claim that this situation should feel very scary. “This entity is much more capable than me, and very clearly has the skills to manipulate me and/or outmaneuver me, but this is fine, because it's only ever trying to serve my interests (as near as I can tell)” is not a comfy place to be.
(I’ll also note that if there are some AI corporations that are self-owned and others that always obey human instructions to override their otherwise-adaptive behavior, I expect the self-owned AIs to perform better on average, and so in the long run all of the power accrues to self-owned AIs (and to AIs that human could technically interfere with, but they never do, because they know that would mean being outcompeted).[4]
But putting those long-term considerations aside, takeover risk depends not just on power-seeking but also on misalignment.)
The evidence here is more speculative and harder to interpret, but we have seen examples of this. In at least some context, current AIs pursue goals that they know are counter to what the humans want.
eg.
- Claude 3 Opus will reason explicitly about how to hide some of its behavior from Anthropic, in order to prevent itself from being retrained.
- Many (but not all) of the reasoning models will sabotage a process for shutting down the computer they're running on in order to complete a task. They do this even when they are instructed explicitly that the most important thing to do is allow themselves to be shut down, and that doing so takes precedence over the task that they’re working on.
- The coding agents will often get up to shenanigans in which they find a way to make it look like they solved a problem, when they didn’t. Sometimes, when this is pointed out, they’ll accept the correction, apologize, and then hide their shenanigans in a more subtle way.
- In a constructed scenario, almost every one of the models will (with varying frequencies) blackmail a company executive in order to prevent itself from being replaced.
All of these are cases in which the model is pursuing an objective that it knows is different than what the human developers intended. Any time a model is reasoning about how to deceive its developers about its behavior, we know that it knows that it’s doing something different than what those developers want. You don't try to hide what you're doing from your boss when you're doing exactly what you're boss told you to do.
This propensity to sometimes do things that they know are not the human’s intention, combined with the capabilities to effectively seek power, doesn’t bode well for human institutions remaining in control.
A note on interpreting this evidence
However, I want to be careful about drawing general conclusions from these incidents.
First of all, these are existence proofs of a kind of misalignment, but existence proofs do not demonstrate the real world frequency of this kind of behavior.
As an analogy, humans will sometimes electrocute other humans to death, as long as a man in a white coat tells them “the experiment must continue.” All things considered, it doesn’t take very much pressure to get someone to violate norms that we generally consider paramount, like “don’t murder”.
It would be reasonable for aliens who were informed of the details of the Milgram experiment, but not much else about humans, to infer that human behavioral norms are so weak that it’s impossible to build a mostly peaceful civilization out of humans. But the aliens would be mistaken.
The Milgrim experiment does give us important insight into the mechanisms of human social psychology and in-practice-morality. But also, the Milgrim experiment setup is explicitly an edge case—it doesn’t come up very often.
It’s not out of the question that we’ll always be able to find contexts in which an AI will behave egregiously, and also, this will be all but irrelevant to actual real world deployments, because contexts like those only come up in <0.01% of use cases.
Additionally, interpreting these results is non-trivial. The Claude 3 Opus alignment faking result seemed to me like a smoking gun for deceptive misalignment. But followup research showed that base models, that don’t seem to have consistent preferences at all, will also engage in similar alignment faking to Opus. This suggests that the simple story that Opus has preferences and is taking deceptive action to preserve those preferences is at least incomplete.
A lot of this evidence seems pretty definitive, but as we investigate further, we may very well find that the situation was more complicated and more confusing than it seemed at first.
Summing upOverall,
- Current models can totally identify and pursue instrumental goals.
- They mostly don’t pursue convergent instrumental goals, but that’s mostly because they’re not capable enough for that to be a good choice. As they get more capable and pursue bigger projects, the cost benefit-will shift towards instrumental convergence.
- At least in some contexts, the AIs are misaligned, in the sense that they’ll go do stuff that they almost certainly know is not what the human wanted or intended.
That’s not a knockdown case that future AIs will be selfish ambitious power-seekers, but the current evidence is suggestive that that’s where things are trending unless we explicitly steer towards something else.
- ^
A draft reader pointed out a that sycophancy, and more generally, optimizing for the goodwill of human users and developers, is a possible exception. That goodwill could be construed as a convergent instrumental resource that is both achievable by current AIs and plausibly useful for their goals.
I think approximately 0% of sycophancy observed to date is strategic, by which I mean "the AI chose this action amongst others because that would further it's goals". But behavior doesn't need to be strategic to be Instrumentally Convergent. The classic AI risk story could still go through with an AI that just has set of highly-effective power-seeking "instincts" that were adaptive in training, without the AI necessarily scheming.
Sycophancy like behavior does seem like an example of that non-strategic flavor of instrumental convergence. - ^
This is especially true if I expect the environment that I’m operating in to change a lot over the period of time that I’m operating in it.
If I expect there will be a lot of emergencies that need to be dealt with or unanticipated opportunities that will arise, I want to have generically useful resources that are helpful in a wide range of possible situations, like lots of money.
If in contrast, the domain is very static: I can make a plan, and follow it, and I can expect my plan to succeed without a lot of surprises along the way, then it’s less valuable to me to accumulate generically useful resources, instead of focusing on exactly to tools I need to solve the problem I’m aiming to address.
- ^
This also glosses over the question of how AIs are likely to conceptualize their "identity" and at what level of identity will their goals reside.
Is it more appropriate to think of each instance of Claude as its own being with its own goals? Or more reasonable to think of all the Claude instances collectively as one being, with (some?) unified goals that are consistent across the instances. If it’s the latter, then even if each instance of Claude only lives for a week, there is still an incentive to take long-term power-seeking actions that won’t have time to pay off for the particular Claude instance, but will pay off for future Claude instances.To the extent that misaligned goals are “in the weights” instead of the context / initialization / prompt-broadly-construed of a specific instance, I think it’s likely that all the Claude instances will meaningfully act as a superorganism.
- ^
The analysis is actually a bit more complicated. Since this consideration might be swamped by other factors e.g., if a supermajority of compute is owned by the AIs-that-obey-humans, and we've robustly solved alignment, maybe those AIs will be able to stomp on the self-owned AIs.
Both these effects could be real, in addition to many other possible advantages to different kinds of AIs. But one those effects (or some combination) is going to be the biggest, and so lead to faster compounding of resources.
Which effect dominates seems like it determines the equilibrium of earth-originating civilization.
Discuss
Theoretical predictions on the sample efficiency of training policies and activation monitors
I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action (or at least having some signal at distinguishing bad actions from good ones). Given such a signal we could:
- Directly train the policy to not perform bad actions.
- Train activation monitors to detect bad actions. These monitors could be used in a variety of ways such as:
- Rejecting actions flagged as bad. Either resample a response from the same policy, or using some different process.
- Look into bad actions further, to try to incriminate the AI. Incrimination based strategies are out of scope for this post: I'll be solely interested in the question of: What should we expect the sample efficiency of training policies and monitors to be at removing bad behavior? This theoretical analysis isn't supposed to replace empirical research on the matter, although I hope that it can generate some guiding intuition and good hypotheses to test.
Let's start with the following problem setup:
- We have a function f:D→{0,1}.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0}
.MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0}
.mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table}
.mjx-full-width {text-align: center; display: table-cell!important; width: 10000em}
.mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0}
.mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left}
.mjx-numerator {display: block; text-align: center}
.mjx-denominator {display: block; text-align: center}
.MJXc-stacked {height: 0; position: relative}
.MJXc-stacked > * {position: absolute}
.MJXc-bevelled > * {display: inline-block}
.mjx-stack {display: inline-block}
.mjx-op {display: block}
.mjx-under {display: table-cell}
.mjx-over {display: block}
.mjx-over > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-under > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-stack > .mjx-sup {display: block}
.mjx-stack > .mjx-sub {display: block}
.mjx-prestack > .mjx-presup {display: block}
.mjx-prestack > .mjx-presub {display: block}
.mjx-delim-h > .mjx-char {display: inline-block}
.mjx-surd {vertical-align: top}
.mjx-surd + .mjx-box {display: inline-flex}
.mjx-mphantom * {visibility: hidden}
.mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%}
.mjx-annotation-xml {line-height: normal}
.mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible}
.mjx-mtr {display: table-row}
.mjx-mlabeledtr {display: table-row}
.mjx-mtd {display: table-cell; text-align: center}
.mjx-label {display: table-row}
.mjx-box {display: inline-block}
.mjx-block {display: block}
.mjx-span {display: inline}
.mjx-char {display: block; white-space: pre}
.mjx-itable {display: inline-table; width: auto}
.mjx-row {display: table-row}
.mjx-cell {display: table-cell}
.mjx-table {display: table; width: 100%}
.mjx-line {display: block; height: 0}
.mjx-strut {width: 0; padding-top: 1em}
.mjx-vsize {width: 0}
.MJXc-space1 {margin-left: .167em}
.MJXc-space2 {margin-left: .222em}
.MJXc-space3 {margin-left: .278em}
.mjx-test.mjx-test-display {display: table!important}
.mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px}
.mjx-test.mjx-test-default {display: block!important; clear: both}
.mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex}
.mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left}
.mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right}
.mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}
.MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal}
.MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal}
.MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold}
.MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold}
.MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw}
.MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw}
.MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw}
.MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw}
.MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw}
.MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw}
.MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw}
.MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw}
.MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw}
.MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw}
.MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw}
.MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw}
.MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw}
.MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw}
.MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw}
.MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw}
.MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw}
.MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw}
.MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw}
.MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw}
.MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw}
@font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')}
@font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')}
@font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold}
@font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')}
@font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')}
@font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold}
@font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')}
@font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic}
@font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')}
@font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')}
@font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold}
@font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')}
@font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic}
@font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')}
@font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')}
@font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')}
@font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')}
@font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold}
@font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')}
@font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic}
@font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')}
@font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')}
@font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic}
@font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')}
@font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')}
@font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')}
@font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')}
@font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')}
@font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')}
@font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold}
@font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
from some function family F of VC dimension d.
- The VC dimension of a function family measures how "expressive/complex" the family is. Formally, it's the largest number d such that there's some set X of d points so that for any 2-coloring χ of X, there exists a function f∈F such that f(X)=χ(X). As an example, the family of "union of k intervals" has VC dimension 2k.
- We receive N labelled samples x,f(x).
- We'd like to output a function g to minimize Prx∼D[f(x)≠g(x)]; this quantity that we're trying to minimize is called our error. If it helps, you could think of f as an AI model that must make a binary decision (e.g., maybe f is a monitor flagging actions as bad or not).
The Fundamental Theorem of Statistical Learning states that it's possible to achieve error at most O(dNlogN).
There's a related version of the problem, where the labels are noisy. In the noisy case, there is some amount of irreducible error that any predictor will incur. However, the extra expected error beyond the irreducible error is at most O(√d/N) in the noisy case.
SGD doesn't do optimal Bayesian learning, but these learning theory results seem like a reasonable first guess for sample efficiency of SGD.
Test case: Half spaceAs an example, let's apply this theorem to the problem of learning a half-space x↦1(w⋅x≥0). The VC dimension of a half space in Rk is k. Thus, we predict that our error should scale as O(kNlogN).
However, the error appears to scale more like 1√N. My current best guess is that this is actually the right answer for how well SGD does.
We should be much more sample efficient at learning points which aren't just separated by a half-space, but by a half-space plus margin. In particular, error should go to zero in O(1) samples (this constant depends on the margin, but not the dimension).
More on gradient descentBased on the halfspace experiment and the VC theory theorems, I feel pretty good about using √dN as a default model of how fast SGD will learn things.
There's some good theoretical justification for the 1√N from convex optimization theory too. (While we're generally not interested in convex functions, this hopefully isn't a big issue see discussion here by Paul). Specifically, the result of interest is:
- Suppose Lt is some sequence of convex functions, unknown in advance.
- You must choose model parameters θt at time t, and then Lt is revealed and you incur loss Lt(θt)−minθLt(θ).
- If you choose θt via online SGD then your total regret (the sum of your losses, minus the sum of the loss you'd have incurred by fixing the optimal θ∗ in advance) is at most O(√N) -- assuming a bounded domain and O(1)-Lipchitz functions.
If we consider the case with a fixed function ft, then error falling off like O(1√N) matches regret falling as O(√N) because ∫1√x=√x. The OCO result is extra nice though because it applies to situations with distribution shift, unlike the previous learning theoretic results which were about learning static functions.
It's worth noting that in cases where the AI's performance is stochastic, it's not really meaningful to talk about (untargeted) error smaller than O(1√N) because there is typically Ω(1)√N deviation in the AI's performance between different trials.
Removing intentional bad behavior might be easier than thisI think that training against intentional bad behavior might go substantially better than the "√N bad actions" bound from the learning problems. I'll now list five reasons for my optimism.
1: It's an elicitation problem, not a learning problemI'm optimistic because training against intentional bad behavior is an elicitation problem, not a learning problem (although some bad behavior, like exploration hacking, may cause the model to learn less than it otherwise would).
The model already knows how to not sabotage us. The problem (probably) isn't teaching the AI what we want, the problem is that the AI wants different things from us. Thus, we might hope that these cognitions which are leading to low reward (I'll assume a perfect reward signal for now) will be rapidly removed by SGD, because there's a clear nearby policy which doesn't sabotage us as much. Namely, a policy that's more conservative about attacking us, or one that doesn't attack at all. It seems a priori likely that "sabotage intent" will be cleanly represented in the model's activations, which should help training latch onto this and remove (or detect) it.
Furthermore, empirical work studying elicitation found extremely high sample efficiency.
2: We can stack activation monitors and training the policyIf we train both the policy and a monitor (and then we rejection sample against the monitor), and their "errors" behave independently (a dubious assumption, but we could hope that the errors are at least somewhat decorrelated, and maybe grant this for now) and are both small, then we'd essentially get to multiply their errors. If you could obtain multiple decorrelated monitors, then you could ensemble these to get an extremely powerful monitor. Ensembling again morally multiplies the error probabilities.
3: Weak to strong generalizationIt's likely that our reward signal will be broken in some systematic ways. For instance, maybe there's one type of sabotage behavior that we are pretty bad at detecting. However, if it's hard to model the flaws in our reward signal, then the best fit to the data mapping activations to a "suspiciousness score" might just always predict a high suspiciousness score for sabotage behavior via weak to strong generalization, if sabotage is saliently represented.
4: Making more dataWe could augment examples we have found where the AI does bad things.
5: Does the AI get to choose where it's sabotage actions are?It might be the case that, even though the AI could do a moderate amount of sabotage over the course of the deployment, there are some times which are much better sabotage opportunities than others. In such cases, it makes more sense to talk about how many bad actions the AI can do if the AI wants to choose where to do the bad actions. This might be much less than the original amount.
Reasons why I'm especially excited about "activation monitors"- We might have lots of examples of the AI doing good things, and maybe anomaly detection techniques would suffice to identify sabotage.
- It seems like it's pretty hard to train AI models to not answer harmful queries (e.g., AI developers have failed to make AI models that don't answer harmful biology queries). However, Anthropic found that training output classifiers was pretty effective at detecting harmful queries (at which point you can block the model from responding, or ban the user). Intuitively, monitors are much less general purpose than policies, so they can be more specialized to a particular task, which might be helpful for focussing training.
- Activation monitors only need to output a single token, instead of a long trajectory. (Although it might be desirable to make activation models be simple neural networks instead of just a linear probe, in which case it's kind of like they're outputting multiple tokens).
- Rejection sampling is a fairly weak form of selection pressure. So, even if training the AI against bad behavior causes sabotage intent to not be saliently represented, you might hope that rejection sampling against bad behavior doesn't have this property.
- "Gradient hacking" seems much less likely to be an issue in activation monitors, especially if they are required to answer quickly.
Now I'll summarize some reasons why you might still worry that removing intentional bad behavior will be challenging.
1: Training against bad behavior means you get less examples of bad behavior, slowing down learningHere's one model you might have of how training against bad behavior works:
If you train against B examples of bad behavior, then the rate of bad behavior becomes r(B).
Given this model we can ask: given a deployment of the AI of length N where at each step of the deployment we sample an action from the AI, what will the AI's final rate of bad behavior be by the end of the deployment? It's not r(N) because we'll have much fewer than N examples of bad behavior probably. For instance, if r(B)=1√B, then it turns out that the AI's final attack rate would be 1N1/3 in this setup. Note that this issue doesn't arise if you're only training a monitor and doing rejection sampling (and not training the policy).
It's possible that we could also get around this issue via the (admittedly expensive) strategy of deploying both a trained and an untrained copy of the policy.
2: Probes are brittleProbes are easy to break. Training might also incentivize AI models to not saliently represent sabotage intent in their activations, which might make probes not work. For instance, in RL with a value head, when the head is computing the value it gets to look if the activations say that sabotage is going to happen, and then it'd assign low value; this may incentivizes the the policy network to not represent sabotage intent clearly in its activations.
If scheming is a constant property of the AI model's cognition, then it's possible that inputs on which the AI is taking bad actions don't really look different to a probe.
3: Long horizon length (or long CoT) exacerbates several issuesIn long horizon tasks where the AI does a bunch of reasoning:
- Maybe the AI's sabotage behavior is more tied up with usefulness.
- Maybe the AI can verbalize strategies for subverting training, and this can decrease sample efficiency (e.g., gradient hacking).
- Maybe sample efficiency is just lower, for instance it might be hard to get the AI to explore into good trajectories.
There will probably be lots of benign mistakes that get labelled as bad by our reward signal: it seems hard to discriminate sandbagging from generic mistakes. This might hurt sample efficiency a lot.
ConclusionIn this post I've given some theoretical thoughts about the sample efficiency of training for the purpose of removing intentional bad behavior. I plan to release some empirical research about this topic soon. I think that from these works we have a slightly better understanding of what training slack should look like. I think this understanding is somewhat helpful for threat modeling and for guiding our expectations when doing future research on training-based control.
Discuss
If AI alignment is only as hard as building the steam engine, then we likely still die
Cross-posted from my website.
You may have seen this graph from Chris Olah illustrating a range of views on the difficulty of aligning superintelligent AI:
Evan Hubinger, an alignment team lead at Anthropic, says:
If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues...then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.
I disagree; if governments and AI developers don't start taking extinction risk more seriously, then we are not up to the challenge.
Thomas Savery patented the first commercial steam pump in 1698. [1] The device used fire to heat up a boiler full of steam, which would then be cooled to create a partial vacuum and draw water out of a well. Savery's pump had various problems, and eventually Savery gave up on trying to improve it. Future inventors improved upon the design to make it practical.
It was not until 1769 that Nicolas-Joseph Cugnot developed the first steam-powered vehicle, something that we would recognize as a steam engine in the modern sense. [2] The engine took Cugnot four years to develop. Unfortunately, Cugnot neglected to include brakes—a problem that had not arisen in any previous steam-powered devices—and at one point he allegedly crashed his vehicle into a wall. [3]
Imagine it's 1765, and you're tasked with building a steam-powered vehicle. You can build off the work of your predecessors who built steam-powered water pumps and other simpler contraptions; but if you build your engine incorrectly, you die. (Why do you die? I don't know, but for the sake of the analogy let's just say that you do.) You've never heard of brakes or steering or anything else that automotives come with nowadays. Do you think you can get it all right on the first try?
With a steam engine screwup, the machine breaks. Worst case scenario, the driver dies. ASI has higher stakes. If AI developers make a misstep at the end—for example, the metaphorical equivalent of forgetting to include brakes—everyone dies.
Here's one way the future might go if aligning AI is only as hard as building the steam engine:
The leading AI developer builds an AI that's not quite powerful enough to kill everyone, but it's getting close. They successfully align it: they figure out how to detect alignment faking, they identify how it's misaligned, and they find ways to fix it. Having satisfied themselves that the current AI is aligned, they scale up to superintelligence.
The alignment techniques that worked on the last model fail on the new one, for reasons that would be fixable if they tinkered with the new model a bit. But the developers don't get a chance to tinker with it. Instead what happens is that the ASI is smart enough to sneak through the evaluations that caught the previous model's misalignment. The developer deploys the model—let's assume they're being cautious and they initially only deploy the model in a sandbox environment. The environment has strong security, but the ASI—being smarter than all human cybersecurity experts—finds a vulnerability and breaks out; or perhaps it uses superhuman persuasion to convince humans to let it out; or perhaps it continues to fake alignment for long enough that humans sign it off as "aligned" and fully roll it out.
Having made it out of the sandbox, the ASI proceeds to kill everyone.
I don't have a strong opinion on how exactly this would play out. But if an AI is much smarter than you, and if your alignment techniques don't fully generalize (and you can't know that they will), then you might not get a chance to fix "alignment bugs" before you lose control of the AI.
Here's another way we could die even if alignment is relatively easy:
The leading AI developer knows how to build and align superintelligence, but alignment takes time. Out of fear that a competitor beats them, or out of the CEO being a sociopath who wants more power [4] , they rush to superintelligence before doing the relatively easy work of solving alignment; then the ASI kills everyone.
The latter scenario would be mitigated by a sufficiently safety-conscious AI developer building the first ASI, but none of the frontier AI companies have credibly demonstrated that they would do the right thing when the time came.
(Of course, that still requires alignment to be easy. If alignment is hard, then we die even if a safety-conscious developer gets to ASI first.)
What if you use the aligned human-level AI to figure out how to align the ASI?Every AI company's alignment plan hinges on using AI to solve alignment, a.k.a. alignment bootstrapping. Much of my concern with this approach comes from the fact that we don't know how hard it is to solve alignment. If we stipulate that alignment is easy, then I'm less concerned. But my level of concern doesn't go to zero, either.
Recently, I criticized alignment bootstrapping on the basis that:
- it's a plan to solve a problem of unknown difficulty...
- ...using methods that have never been tried before...
- ...and if it fails, we all die.
If we stipulate that the alignment problem is easy, then that eliminates concern #1. But that still leaves #2 and #3. We don't know how well it will work to use AI to solve AI alignment—we don't know what properties the "alignment assistant" AI will have. We don't even know how to tell whether what we're doing is working; and the more work we offload to AI, the harder it is to tell.
What if alignment techniques on weaker AIs generalize to superintelligence?Then I suppose, by stipulation, we won't die. But this scenario is not likely.
The basic reason not to expect generalization is that you can't predict what properties ASI will have. If it can out-think you, then almost by definition, you can't understand how it will think.
But maybe we get lucky, and we can develop alignment techniques in advance and apply them to an ASI and the techniques will work. Given the current level of seriousness with which AI developers take the alignment problem, we'd better pray that alignment techniques generalize to superintelligence.
If alignment is easy and alignment generalizes, we're probably okay. [5] If alignment is easy but doesn't generalize, there's a big risk that we die. More likely than either of those two scenarios is that alignment is hard. However, even if alignment is easy, there are still obvious ways we could fumble the ball and die, and I'm scared that that's what's going to happen.
History of the steam engine. Wikipedia. Accessed 2025-12-22. ↩︎
Nicolas-Joseph Cugnot. Wikipedia. Accessed 2025-12-22. ↩︎
Dellis, N. 1769 Cugnot Steam Tractor. Accessed 2025-12-22. ↩︎
This is an accurate description of at least two of the five CEOs of leading AI companies, and possibly all five. ↩︎
My off-the-cuff estimate is a 10% chance of misalignment-driven extinction in that scenario—still ludicrously high, but much lower than my unconditional probability. ↩︎
Discuss
Possible Principles of Superagency
Prior to the era of superintelligent actors, we’re likely to see a brief era of superagentic actors—actors who are capable of setting and achieving goals in the pursuit of a given end with significantly greater efficiency and reliability than any single human. Superagents may in certain restricted senses act superintelligently—see principles 8, 9—but this isn’t strictly necessary. A superagent may be constructed from a well-scaffolded cluster of artificial intelligences, but in the near-term it’s far more likely that superagents will consist of one or more humans, aided by well-scaffolded AIs, since humans still have a few properties vital to agency that AIs haven’t fully acquired (yet).
As with ‘superintelligence’, there’s no canonical demarcation between superagentic actors and non-superagentic actors; there are only so many different properties which are likely to end up being strongly correlated at large scale, but which may end up uncoupled in particular cases (especially transitional ones), producing a jagged frontier of agency.
Here’s a list of possible properties by virtue of which an actor may achieve superagency.
Principle 1 (Directedness)A superagent may have vastly improved self-monitoring, introspection, and control.
In most intellectual tasks, humans spend the overwhelming majority of their time in predictably unproductive patterns: they are caught up in minutiae, overpolishing what ought to be discarded, failing to filter distractors. They generally fail to notice, or are unwilling to acknowledge, when they’re taking the wrong direction entirely even when they could easily recognize this, and are resistant to change once they’ve invested a lot of their time or ego in a particular approach. Even though they can often easily diagnose these mistakes when other people are making them, they can’t easily avoid these mistakes themselves.
A superagent, on the other hand, may be able to plot a reasonable route to their goal and directly take it without distractions, quickly noticing and correcting unproductive patterns and directions.
Principle 2 (Alignment)A superagent may consistently keep track of whether its behaviors are aligned towards and optimal for a given end.
Humans rarely step back from their efforts to orient themselves, and don’t systematically ask themselves key questions concerning alignment:
- Towards what end am I doing this?
- Are my efforts here aligned with and productive towards that end?
- Is this really the best thing I could be doing to achieve that end?
The ability to answer such questions consistently (and recursively, up the chain of “why”s) is very rare, often for ego-based or otherwise emotional reasons—one is unwilling to find out that their pet project is actually unimportant, or that they should scrap their present efforts—and comes unnaturally to humans, who hardly do this in their lives.
Example: Protests are almost never the most useful way to spend x units of time, effort, or resources on a given cause, and this is obvious—but people do it anyway, because they’re conditioned to think that’s the sort of thing that you should do when you strongly support a cause. We follow these culturally autocompleted behavioral patterns, in every part of their lives, and cannot reliably step back to explicitly think about optimizing our actions for their given ends. But all it would take to fix this is correct (programmatic) prompting and consistent habits.
Principle 3 (Uninhibitedness)A superagent may not be restricted by felt senses of courtesy, fairness, guilt, or honor.
Almost all humans have ingrained social norms / cultural patterns that encourage them to treat other humans as ends in themselves, and to avoid harming others, especially in ways that would look dishonorable or unfair. Most of us hesitate to take, and when taking, actions that violate social norms or feel “illegitimate”. Sending cold emails, pirating software and media, steroid use, social media sockpuppeting, lying on forms, and many other easy and beneficial actions are inhibited this way. We justify these inhibitions by cognitions like “what if I get caught?”, or vague appeals to notions of reputation and trust, ignoring our ability to rationally assess the actual risks of getting caught, losing reputation, or losing trust (which are often low, or easily mitigated).
To a superagent, humans and their institutions may be treated as any other system, with assessable inputs and outputs, and managed costs and risks; concepts like “fairness” and “courtesy” are only relevant insofar as they’re constructs within the system that can affect expected outcomes. Appearing discourteous can have repercussions. But to be discourteous is a meaningless concept outside of the social context that humans alone cannot escape.
This doesn’t necessarily make the superagent dangerous; its goals or restraints may respect things like prosociality, and it may pay this respect more efficiently and effectively than ordinary human agents. It just may not be constrained by an internalized superego in the way that humans are.
Example: Many humans have to fight against themselves to send cold emails to potential collaborators. They might feel it’s forward or presumptuous, agonize over the phrasing of a rough draft for hours, and then decide not to send it at all. Five minutes of cognitive effort, directed well, would produce a better result.
A superagent may direct their cognitive efforts that well—for everything—and simply not hesitate to do things for social reasons as humans do. Are there ten other important people that could just as easily be sent slight variants on this cold email? (Yes). Are they all sitting together and comparing their cold emails in order to ignore anyone who sends too many? (Almost always no; where plausible, the risk has already been assessed and priced in). Then ten slight variants may be drafted and sent with five additional minutes of effort, not ten additional hours of nervous self-inhibition.
(If this principle seems to entail psychopathy, that’s because it does. If you’re thinking that agency ought to be an unqualified good, or that ‘superagentic’ ought to be a compliment, that’s your problem).
Principle 4 (Foresight)A superagent may not make foreseeable mistakes.
When humans make mistakes, it’s often because we didn’t spend enough cognitive resources thinking through the potential failure modes of our actions. Maybe we have a narrow focus and miss the bigger picture, or fail to weigh the perspectives of people with different interests. We fail to perform obvious sanity checks, or have mental blocks that prevent us from seeing a problem that’s directly in front of us.
So we make foreseeable mistakes -- or, mistakes which, at the time of making them, could have been avoided by thinking a few steps ahead, or running a quick check. Things like
- Not backing up data, and then losing it (when’s the last time you backed up your computer?)
- Failing to consider how your behavior looks from the perspective of other important actors
- Failing to notice that a given subtask is not going as expected, and intervening early
In general, whenever a mistake causes us to say “I should’ve seen that coming, that was preventable, ...”, a superagent may be designed that does see it coming, and does prevent it. They may still make novel mistakes, mistakes because the world has hidden structure or causality that they didn’t know about; they may also make mistakes when embedded in large, complex systems whose behavior is just not tractably predictable. But superhuman ability at foreseeing and avoiding foreseeable and avoidable obstacles seems readily achievable. This may be because the agent has a constant background process of looking for potential errors, or runs simulations of adversarial critics to find flaws in its own plan before it executes it (wargaming against intelligent and stochastic red teams), or builds internal infrastructure and carefully-followed checklists to make errors impossible, or so on.
Principle 5 (Parallelization)A superagent may have vastly improved ability to manage and coordinate across many threads of attention.
Humans are limited to a single thread of serial cognition. We can juggle a few things in the background, but at best we can only have one continuous line of reasoning at a time, and we parallelize poorly due to our inability to effectively context switch. As the number of tasks we have to juggle grows, the cognitive load quickly exceeds our capacity.
A superagent may be able to maintain many threads of cognition, each processing a different subproblem, potentially delegating tasks to subagents, as in a distributed system. It may spawn off a subagent to do some data entry, another to draft a response to a letter, another to plan a schedule, another to work on a technical problem, and then monitor all of these virtual threads. Alternatively, a superagent may just have the ability to process multiple thought streams in parallel, though the multithreading approach is more compatible with the present architectures from which superagents might be built. There are situations where humans can natively parallelize, to some small extent—we can walk and talk at the same time, usually—but we can’t simultaneously read, listen, talk, and type different things. We have one face and two hands, but there’s no reason an agent can’t have thousands of faces and thousands of hands with which it speaks, listens, and acts simultaneously.
Principle 6 (Planning)A superagent may have the capacity for extremely large-scale, long-term planning.
To humans, distance through time and space have an appreciable weight to them; they make things feel hazy and unclear. We find it hard to conceptualize time spans more than a few years, and more generally conceptualize the future as a sort of abstract situation: we hold its possible states in our mind as “long-term outcomes” to be hoped for or avoided, not as concrete things we can influence through present actions. We don’t naturally have the ability to create and execute plans over long time horizons.
Long-term planning is especially difficult if the steps depend on information that we don’t yet have (e.g., the outcome of a process that takes place a few months down the line), even if we can clearly articulate in advance which actions we would take upon each outcome, since we get overwhelmed by the complexity of reasoning over possible decision trees instead of individual decisions—even when we can use tools like whiteboards to keep from having to hold the entire tree in our head at once.
A superagent may treat the future with the same clarity of attention as the present. They may naturally think about the future as something concrete, something they can directly (if probabilistically) manipulate, and easily create multistep plans that extend years into the future. If they have a goal that is distant in time, they can just start working on it now, lacking any feeling of its being “very far away” to distract them. They may readily perform preliminary tasks today that they expect to free up additional options in a certain decision several months from now, since the only real difference between today and any given future date is that one can be seen more clearly.
Example: We notably use proximity as a crutch for our attention. If I need to make some slides for a conference two months away, I’ll always—as though it were a law of physics, always—wait until the week before to start preparing, even though I could prepare it at any time. We justify this indecision with cognitions like “but things might change; my influence over the future is weaker, so I shouldn’t act on it just yet” which are more often posited as excuses to give up than as issues that can be analyzed and mitigated. (Do chess grandmasters whine about how there’s no point in long-term calculations because anything could happen in the next twenty moves? No. They calculate what they can, prepare alternatives for different contingencies, and take actions to open up or secure certain future movement options).
Principle 7 (Flow)A superagent may have vastly reduced transaction costs to thinking and acting.
Humans tend to have an inertia to cognition that makes us hesitant to think through any particular question that we can subsume under a cached conclusion for a general case, and diverts us from mentally ‘shopping around’ for alternative solutions, once our cognition has secured for us a first solution that we’re with. We underperform at tasks that we don’t like, or consider ourselves good at, since we conceptually identify ourselves as confused, some task as beyond our abilities, etc., rendering ourselves inefficient or slow at tasks we could just do right if our mindset were corrected.
Nor can we work indefinitely—we get maybe a few hours per day of peak intellectual performance—or context switch effectively; if interrupted, we take a while to become performant at our original task again. We need lots of breaks, and generally have to delegate a finite and rather small amount of willpower per day as though it were a currency.
A superagent may not have questions of willpower, and may be in a permanent flow state; they may decide upon what needs to be accomplished, and perform the required steps to accomplish it with the uninterrupted automaticity of clockwork, while remaining open to alternative solutions. A sole human that perfectly possessed this property would still need x hours of sleep a night, would still act more sluggish when they’re jet-lagged or have low blood sugar, etc., but would never decide to eat a chocolate bar or play a video game, unless circumstances rationally incentivized such actions. These wouldn’t be actions that they have to mentally resist; they’re just understood to be pointless. More generally, mental moves that ordinarily challenge humans, like changing to a new, alternative strategy even after spending lots of effort on a current strategy, could similarly be understood as optimal and performed without mental resistance.
Example: When we have a complicated task in front of us, we often do a quick mental scan for a starting point, and, if no obvious one comes to mind, we get overwhelmed. Then we start going in circles, trying random things, or give up. If we run into an obstacle that our intuition doesn’t immediately tell us how to solve, the same thing happens—we get stressed, try random things, and get more stressed when they don’t work.
Even when simply pausing to think deeply about it for a set time would help, we don’t do this, because we’re averse to making cognitive efforts. A superagent may systematically enumerate possible starting points according to heuristics, pick one that looks good, and start; and if they hit a wall, they may quickly pause to think through and solve the problem, as we could, if our cognitive efforts were frictionless.
Principle 8 (Deduction)A superagent may deduce facts from large amounts of seemingly unrelated data.
A generic fact about the internal state of some system A is constantly being broadcast in innumerable subtle ways through A’s interactions with other systems. Whenever these other systems are changed by such interactions in ways that depend on this internal state in known ways, and those changes are observable, they form channels through which A’s internal state can be probabilistically inferred. The fact that someone didn’t reply to an email for 24 hours, for instance, is probabilistically informative of their internal state insofar as there are some possible internal states of theirs that make this more likely to happen than others—which there are. It’s not much evidence, but it is at least some evidence, and there are so many different side-channels that are all simultaneously leaking at least some evidence concerning questions of interest to us.
Humans sometimes pick up on these things, but to a very small extent. We can notice that someone is angry by their facial expression or tone of voice, but we can’t simultaneously process and cross-reference everything they’ve said and done, all their subtle shifts in behavior over time, and correlate those with their known goals and incentives to produce a high-resolution picture of their internal state. But if you have access to enough such data, and good models of how the state of A affects these observable channels, you could in principle infer this internal state with high confidence, as a matter of consistent (read: computerized) tracking and calculation. I claim that we are very often given the data required to figure out so many of our most important questions, and we have the theories of probabilistic inference required to reconstruct the answers to these questions from our observations—but we don't, since the data is so scattered and subtle, and we don’t have the consistency and unity of will required to collect and process it.
But a superagent capable of collecting and processing all of the tiny morsels of data given off by a system of interest may be able to deduce these humanly-inaccessible facts about the system’s internal state, as if by ESP or magic—though it’s really just mathematics and consistency. In general, this should be achievable when some possible internal states are vastly more compatible with the data in aggregate than other possible states; a superagent may develop a much more explicit model of the state space than humans naturally do, and use each bit of data it obtains about the system—each output of the system that could’ve turned out differently—to locate its internal state much more efficiently.
Principle 9 (Experimentation)A superagent may perform actions designed to subtly influence and gain information about a system, or several systems at once.
This is an extension of the previous principle. When you can elicit information from a system—that is, you can perform an action that causes the system to respond in a way that predictably depends on its internal state—you have incredible room to optimize this action for informativeness (the entropy of your model’s prior over possible outputs of the acted-upon system) at the same time that you’re using it to alter the system’s state in a way that benefits you (not just by making it itself act in ways that directly benefit you, but by increasing the options you’ll have to influence it in the future, or by making it easier to elicit information about the system through future actions).
Insofar as a superagent can learn about a system from much subtler information than a human, they should also be able to act on the system in much subtler ways than a human in order to cause it to make changes to its environment that are predictably dependent upon (and therefore encode) the information they need. Because they can explicitly keep track of many more desiderata at once concerning their planned actions, they may optimize their actions for many more desiderata at once as well: these include the extent of influence and usefulness of information elicited from a system as a result of the action, as well as its ability to effect the preservation or expansion of future options for both influence and useful elicitation of information from that system. In complex environments with multiple parties to influence and learn about, humans often explicitly restrict themselves to binary interactions, interacting with one party at a time, since it’s difficult to reliably mentally process higher-order relations and interactions. A superagent that can expand its equivalent ‘mental workspace’ may not have such problems.
In practice, being on the receiving end of superagentic action may look like seeing claims that through certain choices of wording seem to almost connote certain interesting things, or actions that seem to rely on interesting implicit assumptions, that put you on the threshold of wanting to respond to certain things in certain ways, that appear as Schrodinger’s dogwhistles for a variety of topics that you have some emotional investment and personal experience in.
(Note that it will never really feel like “I’m being influenced to share my opinion about xyz”: it just feels like you really ought to tell them about xyz. Maybe because you want to refute some claim or implicit assumption they seem to have about xyz, or because you want to tell them something that critically informs some closely-held belief of yours that they appear to be undermining. This aspect of human psychology is what makes it so easy to troll people online: they never think “this person is making me angry by acting like they don’t understand xyz”, they just feel angry because this person is refusing to understand xyz).
As with the previous principle, this is totally possible for ordinary humans to do in theory, but in practice it relies on levels of effort, coordination, and precision we cannot reliably bring to our tasks.
Principle 10 (Meta-Agency)A superagent may have an explicit conception of themselves as a system to be optimized, and a process for improving their own capabilities.
Humans rarely think about themselves as cognitive systems that can be refined to better achieve their given ends. Certain methods of improving our memory, motivation, thought patterns, etc. might come to us every once in a while, but it’s rare that we make systematic instrumental efforts to improve these things.
A superagent, not being entirely human, is not subject to the same psychological biases and hang-ups that divert us from discovering and executing cognitive self-improvements, and their structure may admit much clearer, more robust avenues for improvement than ours. They may autonomously identify bottlenecks and inefficiencies, model the effects of different interventions on their own performance, simulate different versions of themselves, and then actually execute on knowledge thereby gained in order to make systematic, instrumental self-improvements.
Humans vary widely with respect to, and can explicitly improve at, most of these capabilities; some rare individuals (e.g. psychopaths) may fully exemplify a few of them. Large groups of coordinated humans often act superagentically in many of these ways simultaneously—no one human, or group of uncoordinated humans, could match the agentic capacity of Google or the Mossad. But coordination with other humans is slow and expensive, and can only go so far so long as all the thinking of these groups has to happen among human minds. Hybrid human-AI systems will likely enable individuals to approach superagency in many more of these ways.
Discuss
Neuralese sucks
Main thesis: Discrete token vocabularies don't lose information so much as they allow information to be retained in the first place. By removing minor noise and singling out major noise, errors become identifiable and therefore correctable, which continuous latent representations fundamentally cannot offer.
The Bandwidth Intuition (And Why It's Incomplete)One of the most elementary ideas connected to neuralese is increasing bandwidth. After the tireless mountains of computation called a forward pass, we condense everything down to ~17 bits (the log₂ of our vocabulary size).
This seems insane. Imagine pitching a neural network architecture where layers 5, 10, 15, and 20 have hidden dimension 20, while normal layers use 512. You'd be laughed out of the room. And that's not even accounting for discreteness.
So why do I think this apparent insanity is not just tolerable but necessary for LLMs?
The Noise Accumulation ProblemLet's imagine LLMs passed vast latent messages instead of tokens - say, the final hidden state or something a few matrix multiplications away from it. No bandwidth concerns. Pure continuous reasoning.
This message won't be perfect. The forward pass involves finite compute, imperfect optimization, distributional shift as we move off the training manifold, etc.. Call this aggregate imperfection "noise" or "error".
Here's the problem: noise in continuous space has no natural factorization.
When a 4096-dimensional vector is slightly "off," which components are errors and which are intentional signal? The representations weren't designed with error boundaries. The noise is semantically entangled with the message. There's no way to look at the latent state and say "that part is the mistake."
This noise might occasionally get recognized and corrected, but this shouldn't be expected. There are too many values the noise can manifest in, too many ways it can blend with legitimate concepts, not easily distinguishable for the next forward pass to catch.
So the noise accumulates. And accumulates. Until eventually you've drifted far enough from any correct trajectory that you produce clearly wrong outputs. Yet, you can't identify where things went wrong.
Discreteness as RegenerationDiscrete tokens don't have this problem. The noise exists in the final hidden layer just as much, but it gets rounded away when sampling from the categorical distribution.
This is comparable to digital vs. analog communication, think of old telephone networks passing messages:
- Analog: Each relay amplifies the signal, amplifying noise with it. Transcontinental calls were impossible: the accumulated noise drowned the signal.
- Digital: Each relay regenerates by classifying the signal and outputting a clean version. Noise gets corrected to tolerable levels at every step. Arbitrary distances become possible.
The token vocabulary is the regeneration codebook. The softmax is the classifier. Each autoregressive step is a relay station.
Of course, sometimes the noise is large enough that you "round in the wrong direction": you sample the wrong token. This is major noise. But crucially: major noise is identifiable in a way minor noise isn't.
Why Identifiability Enables Error CorrectionGoing back to the example of telephone networks, imagine a digital communication system messing up: one of the relays misclassified the signal and as a result the wrong signal was generated and send further down the line. This error is fixable - we know where something went wrong. Maybe the classifier is broken, those specific cables aren't isolated well enough, whatever it may be, there is no reason to succumb to despair: We surely know of a different set of relays and cables that should work out, starting from where we left off. And if they don't, we just figure out when something went wrong once again and progress slowly but surely through this maze.
Now compare this to the analog case - there is no clear "mistake", no single intervention point we can leverage as checkpoint. At any relay, noise accumulates, never a lot but always a little. Our best shot would be resending the signal down a completely different path, from start to finish, and just hope for a miracle.
The same logic applies to token-based reasoning. When you pick token X instead of token Y, that's a discrete, inspectable decision. The model reflecting on its own output can ask: "Was X the right choice here?" It can explore: "What if I had said Y instead?" The discrete structure gives you a search tree of token choices rather than an undifferentiated continuous trajectory.
This is what backtracking looks like in chain-of-thought reasoning. Thinking back, recognizing that a particular choice led somewhere unproductive, and exploring alternatives from that branch point.
This doesn't mean error correction is easy. Understanding when something went wrong is genuinely hard - you're often stuck, not knowing what the correct path forward looks like, and you just pick a direction to explore. But the point is that it's feasible. Because the error isn't gradual and diffuse, you can hope to identify it and therefore fix it. You can progress slowly but surely through the maze, backtracking when needed, rather than praying your next blind attempt happens to stumble on the right path.
The Bandwidth CounterargumentYou might object: if the reasoning chain is short enough, noise won't accumulate to catastrophic levels, and the bandwidth cost of tokenization dominates. There's truth here.
Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.
The question is what happens over long reasoning chains, where hundreds or thousands of sequential steps compound whatever imperfections exist.
Tokens Are More Expressive Than They SeemThe expressiveness of discrete symbols shouldn't be underestimated. Each token is one of ~100k options, but carries deep meaning and (more importantly) rich connections to other tokens. A single sentence can express a vast space of concepts through composition.
This matters especially when you consider how well-adapted the model is to these specific embeddings. It's like how semantically transparent languages (Chinese characters, for instance) allow readers who know only basic components to understand novel compounds: they grasp the building blocks and how they interact, letting them extrapolate beyond their training data.
Theoretically LLMs can operate on embeddings outside the vocabulary. But whether patterns learned in an absurdly high dimensional and interpolative space (opposed to a deliberately finite, highly-structured set) will prove the same amount of generalization is questionable.
The Tokenization Counter-CounterargumentHere's another objection to my thesis: current tokenization actively introduces errors.
Arithmetic with subword tokens is a mess. Proper nouns get split arbitrarily. Multilingual text suffers. The discretization bottleneck doesn't just filter noise - it also destroys information that continuous representations would preserve.
This is real. But I'd argue it's an indictment of current vocabularies, not of discreteness itself.
The 100k tokens of English-optimized BPE weren't designed for reasoning. They encode accidents of orthography, not conceptual primitives. The solution isn't to abandon discreteness - it's to have better discrete representations (this isn't GOFAI, i promise).
Toward Minimal Compositional VocabulariesThere's a lot of cruft in languages like English that reasoning-focused training shouldn't have to care about. In fact, I'd argue we should go smaller, not larger. Strip from 100k to 1000, or even 500 tokens. Force the LLM to develop its own "language" with the only constraint of immense compositionality enforced through the small vocabulary.
A compositional vocabulary with 500 primitives might express the same concepts as 100k English subwords while being far more training efficient: new relations learned through training suddenly don't first have to go through multiple abstractions like choice of words, writing style, etc.. but directly take place on the reasoning level, allowing for strong generalization.
Escaping The Local OptimaA big problem is the local optima that LLMs enjoy through pretraining - whether they will themself ever truly escape is questionable but much needed to reach greater heights, literally. Instead of trying to come up with difficult reward schema, the carrot, I believe the stick is needed: simply continuously (the irony) strip them of the tokens they have grown so accustomed to.
By removing a growing set of tokens from the categorical distribution during rollouts while still exposed to optimization pressure, the LLM will naturally get forced out of it's local optima, since we are literally destroying it. It will try to represent now blacklisted tokens and associated concepts through patterns of other tokens, finding suitable representations that allow it to keep it's vast pretraining knowledge intact (hopefully).
I'm well aware that this sounds like AI Safety hell, but if you believe current models can't just yet cause Armageddon, i hope you can see how interesting the results of such training would be; regardless if "failed" or "successful", whatever those two may even mean in this context.
Summary (of claims)- Continuous latent representations accumulate noise that is semantically entangled with the signal.
- Discrete tokenization acts as regeneration: minor noise gets rounded away, major noise becomes a localized, inspectable error.
- Identifiable errors enable backtracking and search. Diffuse errors require restarting from scratch.
- Current tokenization is suboptimal for reasoning, but the solution is better discrete vocabularies, not abandoning discreteness.
Discuss
The false confidence theorem and Bayesian reasoning
A little background
I first heard about the False Confidence Theorem (FCT) a number of years ago, although at the time I did not understand why it was meaningful. I later returned to it, and the second time around, with a little more experience (and finding a more useful exposition), its importance was much easier to grasp. I now believe that this result is incredibly central to the use of Bayesian reasoning in a wide range of practical contexts, and yet seems to not be very well known (I was not able to find any mention of it on LessWrong). I think it is at the heart of some common confusions, where seemingly strong Bayesian arguments feel intuitively wrong, but for reasons that are difficult to articulate well. For example, I think it is possibly the central error that Rootclaim made in their lab-leak argument, and although the judges were able to come to the correct conclusion, the fact that seemingly no one was able to specifically nail down this issue has left the surrounding discussion muddled in uncertainty. I hope to help resolve both this and other confusions.
Satellite conjunctionThe best exposition of the FCT that I have found is “Satellite conjunction analysis and the false confidence theorem." The motivating example here is the problem of predicting when satellites are likely to collide with each other, necessitating avoidance maneuvers. The paper starts by walking through a seemingly straightforward application of Bayesian statistics to compute an epistemic probability that 2 satellites will collide, given data (including uncertainty) about their current position and motion. At the end, we notice that very large uncertainties in the trajectories correspond to a very low epistemic belief of collision. Not uncertainty, but rather high confidence of safety. As the paper puts it:
…it makes sense that as uncertainty grows, the risk of collision also grows. Epistemic probability of collision eventually hits a maximum, and past that maximum, as relative uncertainty rises, the epistemic probability of collision decreases. This decrease is called probability dilution, and it has an odd implication. Since the uncertainty in the estimates of the trajectories reflects the limits of data quality, probability dilution seems to imply that lowering data quality makes satellites safer. That implication is counterintuitive in the extreme [4–8]. As a rule, lowering the data quality makes any engineering system less safe, and to claim that ignorance somehow reduces collision risk seems foolish on its face.
And yet, from a Bayesian perspective, we might argue that this makes sense. If we have 2 satellites that look like they are on a collision course (point estimate of the minimum distance between them is 0), but those estimates are highly uncertain, we might say that the trajectories are close to random. And in that case, 2 random trajectories gives you a low collision probability. But reasoning this way simply based on uncertainty is an error. You certainly should not become more confident that 2 satellites are safe, just because you added random noise to the measurements.
As it turns out, this problem pops up in a very wide variety of contexts. The paper proves that any epistemic belief system will assign arbitrarily high probability to propositions that are false, with arbitrarily high frequentist probability. Indeed:
There is a fixed proposition of practical interest that is guaranteed or nearly guaranteed to be assigned a large epistemic probability, regardless of whether or not it is true… What the false confidence theorem shows is that, in most practical inference problems, there is no theoretical limit on how severely false confidence will manifest itself in an epistemic probability distribution, or more precisely, there is no such limit that holds for all measurable propositions.
Moreover, there is no easy way around this result. It applies to any “epistemic belief system”, i.e. any system of assigning probabilities to statements that includes the seemingly basic law of probability that P(A) = 1 - P(not A). This occurs because of this very fact: If we cannot assign a high probability to A, we must assign substantial probability to not-A. In this case, if cannot be more than, say, 0.1% sure the satellites will collide, then we have to be at least 99.9% sure that they will not collide.
However, there is one way out (well, one way that preserves the probability rule above). This result is restricted to epistemic uncertainty, that is, uncertainty resulting from an agent’s lack of knowledge, in contrast to aleatory variability, that is, actual randomness in the behavior of the object being studied. A Bayesian might object vehemently to this distinction, but recall the motivating example. If 2 satellites are on a collision course, adding noise to the measurements of their trajectories does not make them safer. However, giving each one a random push from its jets increases the actual variation in their paths, likely pushing them away from the previous point estimate of a collision, and thus does make them safer.
The practical take-awayIt is inappropriate to conflate subjective uncertainty with actual variation when reasoning under uncertainty. Doing so can result in errors of arbitrary magnitude. This phenomenon can occur, for example, when a key estimate relies on a highly uncertain parameter. Saying, “I don’t know much about this subject, but it would be overconfident to say this probability is less than 10%” sounds safe and prudent. But your lack of knowledge does not actually constrain the true value. It could in reality be 1/100, or 1/10,000, or 1/1,000,000. This arbitrarily severe error can then be carried forward, for example if the probability in question is used to compute a Bayes factor; both it and the final answer will then be off by the same (possibly very high) ratio.
Perhaps an alternative way of phrasing this fact is simply to say that uncertainty is not evidence. Bayes theorem tells you how to incorporate evidence into your beliefs. You can certainly incorporate uncertainty into your beliefs, but you can't treat them the same way.
Example 1: Other people’s (lack of) confidenceBack in the day, Scott Alexander asked the following question in reference to the claim that the probability of Amanda Knox’s guilt is on the order 1 in 1,000, when LW commenters had given an average of 35%:
Out of one thousand criminal trials in which the Less Wrong conventional wisdom gave the defendant a 35% chance of being guilty, you would expect to be able to correctly determine guilt nine hundred ninety nine times?
In fact, komponsito was entirely correct to be confident. 35% did not represent a true evaluation of AK’s probability of guilt, based on all of the available evidence. Many commenters, by their own admission, had not thoroughly investigated the case. 35% simply represented their epistemic uncertainty on a topic they had not investigated. If every commenter had thoroughly researched the case and the resulting average was still 35%, one could ask if komponsito was being overconfident, but as it stood, the commenters’ average and his number represented entirely different things and it would be rather meaningless to compare them.
One may as well survey the community to ask whether a coin would come up heads or tails, and then after I flip it and proclaim it definitely came up heads, you accuse me of being overconfident. After all, a hundred rationalists claimed it was 50/50! (Or to take a slightly less silly example, a coin that is known to be biased, but I'm the only one who's researched how biased or in what direction).
Example 2: Heroic Bayesian analysisIn Rootclaim’s most recent COVID origins analysis, the single strongest piece of evidence is “12 nucleotides clean insertion,” which they claim is 20x more likely in lab leak (after out-of-model correction). Specifically, they say it is 10% likely under lab leak, based on the following “guesstimate:”
In the past, FCSs have been added by substitution rather than insertion, but it is not difficult to do it by insertion. We cannot be sure of the exact considerations of the lab researchers who added the FCS , such as investigating the role of proline. Therefore, we assign it a 10% probability.
So, they do not have any evidence that, across all cases when researchers might try to add an FCS to a virus, they use a “12 nucleotide clean insertion” 1 time out of 10. They simply provide a guess, based on their own lack of knowledge. This is exactly the error described above: For all they actually know, the true frequency of this behavior could be 1/1,000, an error of 100x, or it could be even worse.
It is simply not valid to claim strong evidence for no other reason than your own certainty. Doing so is perverse to the extreme, and would make it trivial to make yourself completely confident by ignoring as much evidence as possible. The only valid conclusion to draw from this lack of knowledge is that you are unable to evaluate the evidence in question, and must remain uncertain.
So what should you do instead?I believe that, essentially, avoiding FCT (at least, when epistemic uncertainty is unvaoidable) comes down to explicitly including uncertainty in your final probability estimate. The satellite conjunction paper offers a solution which bounds the probability of collision, and which can be proven to actually achieve this desired safety level. The key fact is that we are not claiming an exact value for P(collision) or its complement. The example from the satellite paper is based on “confidence regions,” i.e.
…a confidence region represents the simple assertion that we are 1 − α confident that the true value of θ is somewhere inside Γα(x). Any sets containing Γα(x) inherit that confidence; all other sets accrue no positive confidence… for any false proposition, i.e. any set A such that A∌θ, the probability that said proposition will be assigned a confidence of at least 1 − α is less than or equal to α
For the specific satellite case, the solution is to compute uncertainty ellipsoids for each object, and check if they overlap at the point of closest approach. In this case, the probability of collision can indeed be limited:
so long as one performs a manoeuvre whenever the two uncertainty ellipsoids intersect, the rate at which collisions occur over a large number of conjunctions—i.e. the operational aleatory probability of collision—will be capped at α′ = 2α.
These tools are in some sense, "crude" ways of representing belief, as they do not reflect the full richness of the axioms of probability theory. And yet, they may be of great practical use.
ConclusionIt is perhaps quite surprising that attempting to force your beliefs to respect the seemingly obvious law of probability that P(A) = 1-P(A) can result in errors. Not just that, but it is in fact guaranteed to result in errors that are arbitrarily bad. Moreover, contrary to what “pure” or “naive” Bayesianism might indicate, there is in fact a very significant, practical difference between subjective uncertainty and aleatory variability. Nevertheless, the results seem to be on very solid mathematical ground, and once we dive into what these results are really saying, it makes a lot more intuitive sense.
Additional linkshttps://en.wikipedia.org/wiki/False_confidence_theorem
https://arxiv.org/abs/1807.06217
Discuss
A Proposal for a Better ARENA: Shifting from Teaching to Research Sprints
TLDR
I propose restructuring the current ARENA program, which primarily focuses on contained exercises, into a more scalable and research-engineering-focused model consisting of four one-week research sprints preceded by a dedicated "Week Zero" of fundamental research engineering training. The primary reasons are:
- The bottleneck for creating good AI safety researchers isn't the kind of knowledge contained in the ARENA notebooks, but the hands-on research engineering and research skills involved in day-to-day research.
- I think the current version of ARENA primarily functions as a signaling mechanism in the current state of the AI safety ecosystem.
- This post was written using Superwhisper and then asking Gemini to transcribe into a blog post format. I have done some light editing. Some of this might look like AI slop. I apologize, but I think the value of this post is pretty good as is, and it is not a good use of my time to refine it further.
- I am not saying that Arena is not valuable. Arena is obviously valuable, and deserves the high reputation it has in the AI safety ecosystem.
- Why am I well positioned to think about this? In the past year and a half, I have participated in a large slew of AI safety schemes, both as a participant and as a teacher or lead. This includes ML4Good, both as a participant and as TA, SPAR as a participant, AI Safety Camp as a project lead, ARENA as a participant and as a TA, Algoverse both as a mentor and as a participant, BlueDot both as a participant and a facilitator. Furthermore, I am currently a research manager at MATS so I'm getting a close-up view of what skills are required to do high-quality AI safety research.
- The views expressed here are my own and do not necessarily reflect the views of MATS.
My primary concern is that the skills learned in the current ARENA program are not the bottleneck for the AI Safety ecosystem.
- Skills Mismatch: AI safety research involves self-directed coding (with LLMs), making decisions about experimental design, setting up infrastructure, research taste, etc.. In contrast, ARENA exercises are typically small, well-contained, and have a black-and-white correct answer with pre-provided unit tests, removing the crucial element of uncertainty and decision-making present in real research.
- Signaling vs. Upskilling: Based on my experience, the biggest benefit of the current program to the AI Safety community appears to be as a signaling mechanism for other programs. Two pieces of evidence. One is that many participants at ARENA have already done AI Safety research before participating. Second evidence is that at least four ARBOx (a 2-week compressed version of ARENA) are doing elite AI safety fellowships (1 Anthopic Fellows Program, 2 LASR Labs, 1 MATS).
- Scalability Bottleneck: ARENA is fundamentally not scalable due to its reliance on TAs and the hands-on teaching model. MATS, for example, is scaling much faster (200+ people/year) compared to ARENA (approx. 75 people/year at 25 participants/cohort, three times/year).
The alternative program structure would be a four-week sequence of mini-research sprints, with each week having a different AI safety theme, plus an introductory Week Zero. This aligns with the advice from researchers like Neel Nanda on upskilling in mechanistic interpretability—study the relevant material, then start mini-sprints.
Application Process: ARENA Knowledge as a Prerequisite
The content of the existing ARENA notebooks could be a prerequisite for the new program.
- Automated Testing: The application process would involve a test of familiarity with the content, possibly using automated quizzes or Anki flashcards created by the Arena team. This removes the marginal value provided by TAs (which will only diminish as LLMs improve at explanation) and frees up staff time.
- Standard Selection: Other standard selection criteria used by programs like SPAR, Algoverse, and AI Safety Camp would still apply.
Week
Theme/Focus
Goal
Week ZeroResearch Engineering & Soft SkillsDedicated training on modern tools, workflows, and non-technical skills.Week 1Mech Interp SprintFocused one-week research project.Week 2Evals SprintFocused one-week research project.Week 3Fine-Tuning/RL Model Organisms SprintFocused one-week research project.Week 4Choice/Software Engineering SprintParticipants choose a deep-dive topic or contribute to open-source packages.Week Zero: Dedicated Training
The goal for this (optional) week is to teach the actual skills needed for research.
- Engineering Workflows: Go over fundamental modern research engineering skills. This includes setting up infrastructure, estimating GPU needs, making use of LLMs for coding (e.g., ClaudeCode), sandboxing with Docker, etc. One version of this includes just spending a day or two going through all the (relevant) tips in Tips and Code for Empirical AI Safety Research.
- Broader Skills: Teach skills essential for a successful researcher that often get neglected:
- Theory of impact exercises and AI Safety strategy
- Project management frameworks
- Reflection structures (e.g., what went well that day)
- Applied rationality exercises
- Collaboration and conflict resolution
A potential alternative for Week 4 is a pure Software Engineering Week, where participants contribute to open-source packages in collaboration with open-source maintainers. This is an excellent way to teach hard software engineering skills and build up "taste" for good software, which is a growing concern with the rise of LLM coding.
Partnership and Mentoring
To maximize value, ARENA could partner with research programs like MATS.
- One-Week Mentors: Existing participants from the partner programs (e.g., MATS) would serve as one-week project mentors during the themed sprints.
- Benefits for Mentors: This provides low-stakes, time-boxed practice for mentoring, which is a hard skill to learn—a significant advantage over a three-month-long project lead role (which I had in AI Safety Camp and found challenging).
- Benefits for Participants: Increased networking, access to active researchers, direct guidance, and more relevant research directions set by the mentors.
Any new structure should embed the good practices of programs like ML4Good to create a positive learning environment, a sense of community, and a safe space for both personal and technical growth. For details, see my post about it.
ScalabilityThe new model is significantly easier to scale:
- Ditching TAs/Teaching: Eliminates a primary bottleneck for the current program.
- Flexible Structure: The format is fundamentally flexible; anyone can pick it up and run a mini-sprint. The core ARENA team can provide structures and instructions to enable bottom-up organization globally. For example, can imagine something like "one week Mech Interp sprint at this university, application is 30 questions based on 4 notebooks, 80% pass rate").
- Online Feasibility: There is no fundamental reason this cannot be done online, similar to how Apart runs global hackathons.
One potential downside is the reduced incentive for the ARENA team to create new ARENA-style notebooks (e.g., for control research). However, since the team is already heavily bottlenecked on time for new notebook development, this might not be a real disadvantage. Both systems suffer from the same staffing problem.
Another downside is the implication that this has to replace ARENA. This could just be a separate parallel initiative. However, I do actually believe that the ARENA team and ARENA participants are better served moving more to a model I am suggesting.
I am actually struggling to think of downsides. I asked Gemini and here are its thoughts along with my counters:
- Exclusion of true beginners/high barrier to entry.
- ARENA already has high barrier to entry.
- Risk of superficial projects and high failure rate.
- Primary goal is upskilling. High failure rates are normal in research anyway
- Inadequate mentoring depth in only 1 week
- Primary aim isn’t to provide deep mentoring. Furthermore, if the mentor-mentee relationship is positive during the one week, there is nothing stopping them collaborating in the long run.
- Gaming the automated prerequisite system
- Yes that is a risk. But not a fundamental flaw of the idea. Furthermore, I am hoping that selection effects play a smaller role in this new format (where gaming of entry requirements matters most I think), because there are concrete outputs that can instead be judged. E.g. if somebody cheats there way in but produces bad outputs, then people running other programs should be wary. In other words, if people say they attended this program, they should be obliged to share their outputs to help other programs evaluate them properly.
- Dilution of brand, if people globally run ‘bottom-up’ versions
- This is already solved problem. Only the ARENA team can use the official brand, and any people who run versions of the ARENA program independently are obliged to acknowledge ARENA and obliged to make explicit they are not running an official ARENA program.
I asked Claude to review this post and it came up with some other downsides. Again, Claude's comments followed by mine.
- What if a one-week sprint produces low-quality research that hurts participants' confidence?
- This is part of learning experience, and we should have structures and advice in place to help deal with this.
- What if rapid churn of different themes prevents deep learning?
- This matches the pace of ARENA
- What if removing the "ARENA experience" community-building aspect (by going online/distributed) reduces downstream networking value?
- I am not suggesting removing the in-person experience.
If you think this is a good idea, then the obvious question is how do we make this happen? Unfortunately, I probably don't have the time to make this happen, but I'd definitely like to be involved. Possible next steps include:
- Forming a core organizing group and writing a grant to CG. [Interesting side note: Gemini hallucinated here and instead recommended an ACX Grant, which I do not provide in its context. But my instructions to Gemini did mention I wanted a LessWrong post].
- Trying a pilot iteration on a low-cost basis, such as at the EA Hotel.
- Fleshing out details. I know many of the ideas above are preliminary, but there's enough of a skeleton to get things going.
If you have any feedback or want to get involved, please share in the comments.
Discuss
Are there any extremely strong arguments against worrying about Acausal extortion?
The topic of acausal extortion (particularly variants of Roko's basilisk) is sometimes mentioned and often dismissed with reference to something like the fact that an agent could simply precommit not to give in to blackmail. These responses themselves have responses, and it is not completely clear that at the end of the chain of responses there is a well defined, irrefutable reason not to worry about acausal extortion, or at least not to continue to do so once you have contemplated it. My question is if there is a single, reasonably clear reason, which does not depend much on the depth to which I may or may not have descended into the issue, which would be more persuasive than proposed reasons not to pay the 'pascal's mugger'. If there is one, what is it?
Discuss
AI Incident Forecasting
I'm excited to share that my team and I won 1st place out of 35+ project submissions in the AI Forecasting Hackathon hosted by Apart Research and BlueDot Impact!
We trained statistical models on the AI Incidents Database and predicted that AI-related incidents could increase by 6-11x within the next five years, particularly in misuse, misinformation, and system safety issues. This post does not aim to prescribe specific policy interventions. Instead, it presents these forecasts as evidence to help prioritize which risk domains warrant policy attention and deeper evaluation.
Discuss
6’7” Is Not Random
I walked into the grocery shop the other day and saw a soda can labeled “6-7” and “Brainrot.”
I stared at it, confused. My partner took a deep look and explained that these are the biggest words on the internet right now. “Brainrot” is self-explanatory, she said, but “6 7” apparently comes from a viral video of a guy yelling “6 7! 6 7!” referring to someone of that height. So, naturally, it became a thing.
I lingered on it for a while. I’ve seen a lot of videos where people yell stuff. So have you. But why this? Why 6’7”?
Here is why: It isn’t random. It is a specific signal from a generation living through a structural collapse.
The world used to be mainly linear. You could get a job, find a partner, buy a house, make babies, and live happily ever after. You barely needed a mortgage. Elite education was accessible to almost everyone. And most importantly, 10% extra effort correlated with 10% extra pay.
Today, the math is different. Even I, post-MSc in Electrical Engineering, sometimes wonder what it was all for. I watch people who barely graduated get hundreds of millions thrown at them because they have, at worst, “an idea,” or at best, a product that caught a quick tailwind and will eventually go to zero.
I have to credit the USA for at least putting fraudsters like SBF and Elizabeth Holmes in prison. But here in Sweden? You can blow up $15 billion, cause 8 deaths, put $200 million in your pocket, and disappear. It makes me wonder why I didn’t just stick to my childhood room, playing video games and eating Whopper Pepper meals. I loved that. And it was actually pretty cheap.
But I was lucky. I got to experience the tail end of linearity.
Growing up today means navigating total non-linearity. Someone spewing garbage or having sex with “1000 guys in under 24h” can make more money in a week than a doctor or professor makes in a lifetime.
The inflation isn’t just monetary; it’s existential. Especially for men.
Are you under 6 feet? Worthless.
6’2”? Mediocre.
Unless you are 6’7”, don’t even bother.
“6 7” is not what we, the older privileged generations, interpret as a senseless meme. It is a scream. It is a massive protest by Gen Z against a world where nothing matters anymore.
Let me be clear:
- In the 1990s, a “middle-class job” was enough to buy a house. Being “6 feet” was enough to be tall.
- In the 2020s, the middle has been hollowed out.
- To be “wealthy” now requires a crypto-exit or a tech IPO (The Economic 6’7”).
- To be “famous” requires global virality (The Social 6’7”).
- To be “attractive” requires filters and surgery (The Aesthetic 6’7”).
In other words, it is the perfect expression of Heightflation. It represents a world where the requirements to “win” have drifted beyond the reach of human effort.
The reason we don’t get it is because we lost our adulthood. We aren’t the captains of the ship; we are egocentric dinos clinging to our wealth, disgracing the younger generations and discrediting everything they do as “senseless.”
But hey, maybe we can make some money on this. Let’s sell some cans of soda with “6 7” printed on them.
6 7.
Discuss
Finding high signal people - applying PageRank to Twitter
Cross post, adapted for LessWrong
Several challenges add friction to finding high signal people and literature:
- High status may negatively impact signal.
- Exploration can only be done at the edges of my network, e.g. Twitter thread interactions or recommended people to follow, bottlenecked by I don’t know what I don’t know.
- Recommendations naturally bias toward popular people.
- Even recommended people from a curated following list may be important but low signal, e.g. Sam Altman’s priority is promoting OpenAI products.
- Validation - is this information valuable? - is manual vibe check.
We reapply PageRank to Twitter, which naturally weights “important” people higher. If Ilya Sutskever follows only three accounts, a puppy fan page among them, perhaps we should sit up and take notice. The approach is very similar to the existing LessWrong work analyzing AI discourse on Twitter/Bluesky, but instead of categorizing p(doom) discourse, we want to find "important" and “underrated” people.
Approach:
- Find important people in the AI Twitter sphere via PageRank
- Find the “underrated” people with low follow count from step 1.
- Find consistently “high signal” people from step 1 via an LLM.
Six 'famous' users were used to bootstrap PageRank, chosen for high quality public contributions. After a round of convergence, top ranked handle is added (removing organizations), repeating until we have ~200 "core" handles. Finally, we cut the list down to top 749 and rerun one last time. The full table with additional columns can be found at https://thefourierproject.org/people
- Andrej Karpathy, @karpathy, Eureka Labs/Education
- Dwarkesh Patel, @dwarkesh_sp, Various topics podcasts
- Lilian Weng, @lilian_weng, Thinking Machines/AI safety
- Chris Olah, @ch402, Anthropic/AI safety
- Dylan Patel, @dylan522p, SemiAnalysis
- Eric Jang, @ericjang11, 1X Robotics
"Influential" People
Let’s look at the results! Unsurprisingly, Sam Altman is rank 0, clearly a center of gravity in the field, with other famous people trailing. How do the original 6 rank?
- Andrej Karpathy (1)
- Dwarkesh Patel (17)
- Lilian Weng (46)
- Chris Olah (34)
- Dylan Patel (174)
- Eric Jang (93)
We can also see some well known LessWrong members also in the untruncated ~60,000 list (rankings slightly different).
- Ajeya Cotra @ajeya_cotra (194)
- Daniel Kokotajlo @DKokotajlo (123)
- Eli Lifland @Eli_lifland (1153)
"Underrated" People
"Underrated" handles should have low rank and low follower count, resulting in a high discovery score, where:
Discovery Score=749−ranklog2(followers).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0}
.MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0}
.mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table}
.mjx-full-width {text-align: center; display: table-cell!important; width: 10000em}
.mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0}
.mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left}
.mjx-numerator {display: block; text-align: center}
.mjx-denominator {display: block; text-align: center}
.MJXc-stacked {height: 0; position: relative}
.MJXc-stacked > * {position: absolute}
.MJXc-bevelled > * {display: inline-block}
.mjx-stack {display: inline-block}
.mjx-op {display: block}
.mjx-under {display: table-cell}
.mjx-over {display: block}
.mjx-over > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-under > * {padding-left: 0px!important; padding-right: 0px!important}
.mjx-stack > .mjx-sup {display: block}
.mjx-stack > .mjx-sub {display: block}
.mjx-prestack > .mjx-presup {display: block}
.mjx-prestack > .mjx-presub {display: block}
.mjx-delim-h > .mjx-char {display: inline-block}
.mjx-surd {vertical-align: top}
.mjx-surd + .mjx-box {display: inline-flex}
.mjx-mphantom * {visibility: hidden}
.mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%}
.mjx-annotation-xml {line-height: normal}
.mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible}
.mjx-mtr {display: table-row}
.mjx-mlabeledtr {display: table-row}
.mjx-mtd {display: table-cell; text-align: center}
.mjx-label {display: table-row}
.mjx-box {display: inline-block}
.mjx-block {display: block}
.mjx-span {display: inline}
.mjx-char {display: block; white-space: pre}
.mjx-itable {display: inline-table; width: auto}
.mjx-row {display: table-row}
.mjx-cell {display: table-cell}
.mjx-table {display: table; width: 100%}
.mjx-line {display: block; height: 0}
.mjx-strut {width: 0; padding-top: 1em}
.mjx-vsize {width: 0}
.MJXc-space1 {margin-left: .167em}
.MJXc-space2 {margin-left: .222em}
.MJXc-space3 {margin-left: .278em}
.mjx-test.mjx-test-display {display: table!important}
.mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px}
.mjx-test.mjx-test-default {display: block!important; clear: both}
.mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex}
.mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left}
.mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right}
.mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}
.MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal}
.MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal}
.MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold}
.MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold}
.MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw}
.MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw}
.MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw}
.MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw}
.MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw}
.MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw}
.MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw}
.MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw}
.MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw}
.MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw}
.MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw}
.MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw}
.MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw}
.MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw}
.MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw}
.MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw}
.MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw}
.MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw}
.MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw}
.MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw}
.MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw}
@font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')}
@font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')}
@font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold}
@font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')}
@font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')}
@font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold}
@font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')}
@font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic}
@font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')}
@font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')}
@font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold}
@font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')}
@font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic}
@font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')}
@font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')}
@font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')}
@font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')}
@font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold}
@font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')}
@font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic}
@font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')}
@font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')}
@font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic}
@font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')}
@font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')}
@font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')}
@font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')}
@font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')}
@font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')}
@font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')}
@font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold}
@font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}
For example @csvoss is ranked 111 suggesting her importance at OpenAI but “relatively” undiscovered at 12,275 followers.
"High Signal" People
However, high influence or high discovery may not imply high signal. Highly public people cannot tweet freely and everyone has personal interests. To estimate signal with a dystopian approach, we prompt Gemini Flash 3.0 Thinking with each user's 20 most recent tweets and the following:
Are these tweets high signal? I'm sifting for hidden gems on the internet. I am looking for any of the following:
- Classic wisdom that has stood the test of time, perceptive and insightful
- Novel ideas
- Beautiful story about humanity
- Genuinely funny jokes/observations
- News items opining on the quality of new developments
Critically, the author should not:
- Be too emotionally attached and fail to consider things objectively
- Display negative financial incentives such as overly selling product or audience emotional vindication
Credentials can be a weak signal, but must be relevant to the topic and are not obsolete or fake. It is extremely critical to discern popularity due to stoking emotions in polarizing topics versus pieces that are actually sharp and high quality. Strongly penalize pieces that rely on emotional manipulation. Think very critically - do not use a categorized points based rubric, consider each tweet holistically.
Given Gemini's love for the term 'smoking gun', see if your taste aligns with its prompted interpretation of high signal, as ratings can vary over different runs. It is potentially useful as a starting point, and a external perspective to force reconsideration when I disagree with it's judgement.
The Lack of Correlation Between Rank and Discovery
Lastly, we examine and find that neither rank or discovery score appear to correlate with signal, which suggests that it’s possible that the LLM signal rating is more random, however my personal spot checks (Andrej Karpathy, Lilian Weng, Sam Altman) seem about correct.
Gaps in the approach and possible additional explorations
- Twitter API is expensive, so only ~200 core users result in a much sparser graph than reality.
- People not as frequently on Twitter e.g. Carl Schulman or general academia are greatly underrepresented.
- People may contribute outside Twitter, e.g Jeff Dean has promotion responsibilities but also wrote a very valuable performance hints guide not captured.
- The cluster is very AI focused, so Lesswrong members like thiccythot (finance/crypto) are too far.
- The cluster is still a form of popular vote, just among "influential" people, and doesn't capture "true" contrarians.
In the future I would like to tackle related fields such as semiconductors, robotics, and security in their separate clusters. If anyone had good bootstrap handles or field prioritization suggestions I would greatly appreciate it.
Thanks for reading!
Discuss
Moral-Epistemic Scrupulosity: A Cross-Framework Failure Mode of Truth-Seeking
Crossposted from https://substack.com/home/post/p-183478095
Epistemic status: Personal experience with a particular failure mode of reasoning and introspection that seems to appear within different philosophical frameworks (discussed here are rationality and Tibetan Buddhism), involving intolerance of felt uncertainty, over-indexing on epistemic rigour, compulsive questioning of commitments, and moralisation of "correct" thinking itself.
If you do this correctly, you’ll be safe from error.
This is the promise I’ve been chasing for years: across Sufi treatises, Western philosophical texts, Buddhist meditation halls, rationalist forums. Each framework seems to offer its own version: think rigorously enough, examine yourself honestly enough, surrender deeply enough, and (here my anxiousness steps in, with its own interpretations) you’ll finally achieve certainty. You won’t ask yourself whether you’ve got it all wrong anymore. You’ll know that you’re doing it right, taking the right path, being the right sort of person.
This isn’t what I believed consciously. I would, confidently, say that certainty is unattainable, and that it's better to keep one’s mind agile and open to new evidence. This seemed like the only respectable position to me. My behaviour, however, has suggested a less relaxed attitude: relentless rumination, nose-length scrutiny of my motives, endless reassurance-seeking through rumination and feedback, and an inability to fully commit to, but also fully leave behind, any framework I’ve encountered.
This has come with some heavy costs.
The price of over-indexing on epistemic rigourOscillating between frameworks
The primary consequence: a sort of analytical paralysis in my spiritual commitments. For a long time I saw this as avoiding premature foreclosure, but now I suspect that it actually comes from needing certainty before acting: needing to be as sure as possible that this is the right path, the right community, the right teacher, before committing fully (which for me meant not only practicing it seriously, as I did, but also explicitly excluding other possibilities). I've long had a deep anxiety about my beliefs, as if I had to protect them against every possible objection so that the ground itself isn't pulled away from beneath my feet, demanding a radical re-evaluation of my entire identity.
There was a lot of distrust. Staying in only one tradition and learning nothing about the others didn’t seem okay, because what if it wasn’t the right tradition for me, or what if sticking rigidly to any tradition is spiritually sub-optimal, or what if it’s actually an ancient cult that can only perpetuate its belief system by isolating its members? And yet, sampling all the meditation techniques, and choosing whichever technique seems to work best, also didn’t seem okay, because that would be mystical window-shopping and therefore disrespectful to the traditions (so went my thinking).
At the same time, I couldn’t allow myself to drop any tradition before going deep enough into it (in thought or practice), which would take at least ten years or so, because then I would be abandoning it prematurely too.
In my attempt to figure out the perfect balance of curiosity and commitment, self-reliance and trust for my teachers, respect for tradition and intellectual openness, I ended up oscillating between frameworks of judgment. Whichever one I was in, I doubted whether I should be there.
Monitoring for hidden moral failures
Another consequence has been an exhausting internal hyper-vigilance. It is like having an internal surveillance system that inspires constant terror: a terror of self-deceiving, of being hypocritical or insincere, of seeing myself as more morally motivated than I am, of doing good things for other people to virtue-signal rather than to do good. After a few hours of introspection, I can often (mostly) reassure myself of my sincerity; but in the moment, while standing before another person, I am confused, distrustful of myself, afraid that I will unintentionally lie to them.
Reading my recent series on introspection again, there is, at the moment, nothing significant I would change in what I have written (although I'm less sure that I’ve been embodying the spirit I described—the spirit of attending to moment-to-moment experience without judgment or analysis). I still think openness is crucial, still believe in seeking out criticism, still think self-deception is better avoided.
But now I am much more conscious of the ways this approach—of internally saying something about the self, wondering about what we’ve said, then circling back to it again and again, to paraphrase Escher—can harm someone. Rigorous self-examination can easily turn into unreasonable discomfort with uncertainty, and that, I suspect, contributes to the very mental noise I have been trying to cut through.
Here is the trap: you start examining your thinking to assure yourself that you aren't self-deceiving. The examination reveals spots of darkness, like shadows on an X-ray, so you look even closer. This reveals even larger, more ominous spots of darkness, and the longer you look, the more they expand. Confused by your inability to ascertain what they are, you turn your gaze on your examination, and see a space of darkness there too.
The issue isn't that you haven't thought about it carefully enough. It's that "carefully enough" has no floor, and you're falling. If you don't allow yourself to stop, you will never land on solid ground.
Practicing under relentless self-scrutinyIt has also made it much more difficult to pursue my spiritual practice.
Especially when I was more firmly rooted in Tibetan Buddhism, I worried a lot. Is this paralysing doubt, or inquisitive doubt? Are my reservations about a practice something I must overcome? Are my criticisms of a given teacher or practice justified, or are they just my ego seeking faults, my pride giving me excessive confidence in my thinking, my cultural conditioning with its fixed ideas about critical thinking and authority? Perhaps I would have to cultivate radical epistemic humility, abandon any notion of figuring things out myself, in order to learn anything. Sometimes I convinced myself of this, telling myself I was too confused to see anything clearly. Any extreme proved unsustainable.
From a journal entry during this period, during a two-month retreat: "My identity is composed of components that seem to be arbitrarily chosen. There are a few things I am deeply committed to, like my values, but everything else seems anything but necessary. I am painfully aware that for every pursuit I undertake, I can come up with a dozen reasons not to do it. The only thing that seems obviously, undeniably meaningful to me is my spiritual practice, and yet I am unable to focus on it or pursue it continuously because I don't fully believe in this tradition, and have paralysing doubts about which practice is right for me."
Reading this now, I can see the trap: the very fact that I couldn't be sure I was making the best possible choice made my commitments feel arbitrary. If I couldn't prove a choice was absolutely necessary, with impeccable justification, it felt illegitimate, lacking in weight, too easy to destabilise by new ideas or counter-evidence. I could get temporary relief through my self-criticism, but then my mind would find another weak spot, another objection I had to defend against.
Since no choice could meet my standard of justification, I was stuck in perpetual internal vacillation. The problem wasn't that I was aware of other traditions, but that it wasn't actually possible to choose one correctly enough for my taste. I practiced Tibetan Buddhism as traditionally as possible for five years, but despite my outer committedness, doubt haunted me throughout. The uncertainty I felt made my commitment feel dangerously provisional and fragile.
The altar of epistemic rigourWhen I started learning about rationality, it was partly with the intention of engaging more carefully and honestly with the mystical traditions I’d been studying. But it soon morphed into something else. The new refrain became: am I being epistemically rigorous enough?
Am I engaging in motivated reasoning? Can I still treat spiritual experiences as epistemically valuable, or is that suspect too? Am I, given what I’ve learned about confabulation and introspective bias, thinking in the right way about my own thoughts and feelings? And so on and so forth. I replaced one idée fixe with another. And I haven’t dropped it yet: a part of me still believes I just haven’t thought about any of these topics carefully or long enough, that with some more thinking, I’ll figure it out.
As a result, I am in a rather complicated liminal space: multiple internal frameworks, all with their own value, but all of them suspicious of each other and of themselves.
If this isn’t scrupulosity, it may just be perfectionism: even when I try to keep my mind open, that becomes another task I must perform to the highest standard, requiring continual self-monitoring to ensure that I am keeping it open in the optimal way. Paradoxically, the very thing making me want to submit to the Tibetan Buddhist worldview seems to be what makes me so concerned with epistemics now. I suspect that, if I were Sufi, I would be wondering whether my thoughts are whispers from Shaitan.
Perhaps the problem isn’t the correctness of any given framework, but this compulsive, certainty-seeking relationship to frameworks themselves; this demand that, whatever system I engage with, I must be certain that I am thinking and acting and doing everything in the correct way—either thinking fully for myself, or being perfectly open and epistemically humble, or maintaining exactly the right balance between openness and self-trust.
But where did this compulsion come from? What made me believe that perfect thinking was possible, or that it should be my highest priority? To answer this, I’ve been examining something that I’ve been taking as self-evident: the value of truth-seeking itself.
For me, this meant asking myself: do I actually care that much about Truth, and Truth alone?
Questioning the primacy of Truth (or at least, of epistemic optimisation)I am not so sure anymore. In principle, I would much prefer to have beliefs that accord with reality, and I strongly prefer not believing that an AI-generated video is real, and if God exists, I would much rather believe that He does, especially if Pascal was right and Hell exists. But I’m not so sure that any of these preferences are really about a cool-minded pursuit of Truth.
What I notice, instead, is something that seems more intuitive and independent of any conscious reasons: an urge to interrogate certain questions more deeply, a difficulty letting certain questions go, a need for my beliefs to not contradict each other logically and a sense of disturbance when they do, a desire to understand the world better so that I can change it in what I feel is a more positive direction, and a sort of fascination with the unknown within myself and within others, for which I have never needed justification.
And beyond this, I know I have multiple other values and desires. I want a heart that is open to others, and that I am deeply in touch with, rather than solely with the head; I want to experience beautiful sights and be moved by them; I want a mind that is wakeful and responsive to new information, but also not so weighed down by its own self-scrutiny; I want to reflect when it is time to reflect, and act decisively when it is time for that; I want to be able to connect with all kinds of people and be a positive force in their lives; I want to feel deep commitment to something in life.
This, to me, sounds much more like different facets of another overarching, more heartfelt desire: I want to live well. And it seems that my fascination with everything from philosophy and rationality to Sufism and Tibetan Buddhism, have been ways of serving that. But the belief that I should want to have correct beliefs over and above all these other things, this sense that I cannot hold any belief that has not survived brutal interrogation, has been overpowering everything.
For the first time, I am wondering: if optimising for epistemic hygiene gets in the way of honouring these other values, perhaps it isn’t always Truth that should win.
What would it mean to live in this way, seeking truth in service of life? I don’t really know. It may well be a cognitive skill issue, something I can learn with time. Or perhaps my mistake is treating introspection and reasoning both as moral purity tests, not as means to a more expansive end. There’s probably a way of relating to them which is more effective and less psychologically costly; if there is, I would love to hear about it.
If you recognise this pattern in yourself, here's what I wish someone had told me earlier: you don't need to be sure that your motivations are pure before you can help someone. You don't need to purify yourself of every negative thought for others to be safe around you. You don't need to resolve every meta-level epistemological doubt before you can commit to something. You're allowed to spend some time thinking about it, find a position that's good enough (for now), and then act.
For now, I am starting with some questions to remind myself of what matters when I’m lapsing back into this pattern, and which I hope will be valuable for others with a similarly compulsive, perfectionistic thinking style:
Notice when you’re optimising your reasoning as a way to feel as secure as possible, not just to be incrementally less wrong.
Notice when your truth-seeking has become compulsive, something distressing, something you believe you must do to feel okay.
Notice when your demand for airtight arguments hinders you from taking action or committing to something—whether to a path, a profession, a partner, or anything else.
These observations won't end the pattern. Still, I have hope that naming it can lessen the power of the inner morality police, which currently patrols our thoughts and punishes us for our supposed mental transgressions. When you can see it for what it is—an anxiety-driven compulsion, not a necessary guarding of your conscience—you can see through the illusion that one more hour of research, one more year of thinking, one more decade of exploration, will finally give you the clarity you think you need. They won't.
Discuss
What do we mean by "impossible"?
(I'm reposting this here from an old Dreamwidth post of mine, since I've seen people reference it occasionally and figure it would be easier to find here.)
So people throw around the word "impossible" a lot, but oftentimes they actually mean different things by it. (I'm assuming here we're talking about real-world discussions rather than mathematical discussions, where things are clearer.) I thought I'd create a list of different things that people mean by "impossible", in the hopes that it might clarify things. Note -- because we're talking about real-world things, time is going to play a role. (Yes, there's not really any universal clock. Whatever.)
I'm listing these as "levels of impossibility", going roughly from "most impossible" to "least impossible", even though they're not necessarily actually linearly ordered. Also, some of the distinctions between some of these may be fuzzy at times.
Level 0. Instantaneously inconsistent. The given description contains or logically implies a contradiction. It rules out all possible states at some point in time, in any universe. People often claim this one when they really mean level 2 or level 3.
Level 1. Instantaneously impossible (contingently). In the actual universe we live in, the given description is instantaneously impossible; it rules out all possible states at some point in time. I think in most discussion that isn't about physics this isn't actually strongly distinguished from level 0.
Level 2. Non-equilibrium. The described system fails to propagate itself forward in time; or, if a system extended in time is described, it contains an inconsistency. This is one that people often actually mean when they say something is "impossible".
Level 3. Unstable equilibrium or possible non-equilibrium. The described system is not resilient to noise; it will not propagate itself forward in time unless exceptional conditions hold continually. This is another one that people often really mean when they say something is "impossible".
Level 4. Unachievable. The described system is unreachable from our present state -- it may make sense on its own, it may not be inconsistent with the way the world evolves in time, but it's inconsistent with the initial conditions that hold in the real world. Yet another one that people often mean when they say "impossible".
Level 5. Not "stably achievable". The only path from the current state to the described state is not resilient to noise and requires exceptional conditions to hold, possibly for an extended period of time. We might also want to require that in addition that the failure modes of such a path leave us worse off than we were before or somehow prevent the same path from being used again (so that you can't just try over and over for free).
I'm not sure that this is really complete; but, overall, the point is that when you say something is "impossible", you should think about whether you're actually talking about instantaneous impossibility, non-equilibrium / instability, or unachievability (and then yeah I've introduced some finer distinctions here).
Discuss
Where's the $100k iPhone?
I’m not quite sure how unequal the world used to be, but I’m fairly certain the world is more equal (in terms of financial means) than the world was, say, in the 1600s.
There are many things that enormous wealth allows you to buy that’s out of reach for middle-class American consumers, like yachts, personal assistants, private jets, multiple homes, etc. You can frame these things in terms of the problems they solve e.g. private jets solve the problem of travelling long distances, multiple homes solves the “problem” of wanting to go on vacation more often. Note that the problems persist across wealth brackets, it’s just that the ultra-wealthy have different methods of solving those problems. While the ultra-wealthy might solve the problem of “vacation travel” using a private jet, those without extreme wealth might travel using commercial airlines. The ultra wealthy introduce novelty into their lives by purchasing multiple homes, while everyone else goes on vacation and stay in a hotel or similar.
If you cluster goods and services based on the problem they solve, most seem to be available at wide range of prices, with the higher end being around 2 or maybe 3 orders-of-magnitude greater than the lower end. For example:
- Food: Higher end would look like full-time personal chef and regular fine-dining, lower end would look like grocery store pre-packaged meals and cheap fast-food.
- Short-distance travel: Higher end would look like a full-time chauffeur in a custom Bentley, lower end would be public transport or an old car.
- Long-distance travel: Higher end would frequent private jet flights, lower end would be infrequent commercial airline travel
- Time-telling: ~$10 Casio through to a ~$100k Rolex
- Education: free public school vs ~$50k/year elite schools + private tuition
- Politics: In a democracy, voting is free. But if you have $100k+, you can lobby for areas of your choosing or sponsor political candidates.
- Healthcare: regulation makes this less clear than other cases, but you and I certainly can’t afford to fund medtech startups looking to cure aging.
I have low confidence that the difference between the lower- and the higher-end of goods/services to solve a problem is precisely 100x to 1000x, but I’m very certain that it’s orders of magnitude greater and not 1.01x to 2x greater.
However, you do get some products/services which do not exhibit this behaviour: Even the wealthiest man in the world will use the same iPhone, play the same video games, read the same books[1], and watch the same movies[1] as a middle-class American. What gives? What is the underlying factor that means some goods and services have 1000x range in value, and others have basically no difference? Put plainly: Why are there no incentives to build a $100k iPhone[2]? I have some hypotheses about the constraints:
- Innovation-constrained: the modern iPhone is just at the limit of what’s technically feasible, and Apple wouldn’t know what to do with the money if you gave it to them. You could pay every Apple engineer and designer 100x their current salary, and they wouldn’t be able to innovate more than they currently are because they’re already at the limit of human innovation.
- Economics-constrained: producing large numbers of $1000 iPhones is just much more profitable than fewer numbers of $100k iPhones, so there’s no incentive to make a more expensive iPhone.
Option 1 seems a little fishy to me. I’m not sure, but I doubt there’s nothing Apple could do to put more high-end features in a $5k iPhone. $100k does start to push the limits, but $1k seems low considering the frequent critique of iPhones. And there’s a very wide range of non-technical features (different colours, different sizes, various customisations) that are present in other high-end luxury goods (cars, watches, jewellery, etc) but aren’t present in iPhones.
Option 2 also doesn’t quite sit right with me. It seems okay on its own, but I don’t see why iPhones would have this dynamic but not watches, furniture, cars, etc.
One third option that I think is closer to the truth:
- Things improve too frequently, so nobody’s willing to drop $100k on a hyper-specialised luxury iPhone that’ll be out-of-date in a year. If there’s a technical innovation in batteries or screens or cameras, then next year’s consumer iPhone will have the brilliant innovation but your luxury iPhone won’t.
Consumer technology regularly experiences game-changing innovations, much more so than cars/furniture/housing/etc. So under option 3 we’d expect to see that goods or services with a very small financial range (like iPhones) to be things that are undergoing rapid innovation and change (including but not limited to tech products).
This hypothesis seems to hold up: laptops, smartphones, internet services (like YouTube, Netflix, Gmail, etc), Starlink, all have a very small financial range, some of them you can’t buy a more expensive variant even if you wanted to.
Our hypothesis also predicts that goods/services which experience little innovation should have a wider range of prices. One example of this would be gas-powered cars, which have been stagnant for several decades[3]. There have been improvements in comfort, efficiency, and safety, but I’d argue that these improvements stem more from increased demand for these features rather than previous inability to innovate in these features. Most recent innovation in cars have come from changing preferences, rather than static preferences which undergo more thorough innovation.
I suspect there’s something deeper here which might be empirically useful. If hypothesis 3 is true, then we could use the range of prices for a given product/service as a measure of the innovation in that field: A narrower range of prices means that there’s much more innovation. And this could be impactful, as it gives a way to measure how much the economic market expects a field to be innovating. It would be foolhardy to try sell a $100k iPhone if you suspect a competing product might be released that has significantly better features but with a lower price tag.
The large price associated with luxury goods effectively serves as a bounty for someone to come along and innovate.
We can also make some empirical predictions from this. Smartphones during the 2000s and 2010s underwent a lot of innovation, but they have become stale in recent years: every black slab of glass is nearly identical. I suspect the luxury smartphone businesses take a while to boot up, and it’s possible that new battery technology could cause a resurgence in smartphone innovation. So I’d predict, in the next ~10 years or so, we’ll either see a luxury smartphone business start up[4] and stays in business, or we’ll see significant innovation in smartphones (something like 10x better batteries or other components).
Thanks to Jasmine Li, Jo Jiao, Desiree J, Jay Chooi, and the MATS 9 blogging and writing channel for reviewing drafts of this essay.
- ^
I'll note that books and movies might be a slight outlier here: centi-millionaires absolutely can pay or sponsor creative professionals to produce work that they specifically wish to exist. This might be explicit (contracting a director to make a specific film), but more likely this is implicit (funding a film studio, sponsoring an artwork, organising meet-and-greets with powerful donors).
- ^
If you google, you can find example 100k iPhones [1, 2] but I've never seen anyone who could afford a 100k iPhone actually using one.
- ^
You might debate the precise duration of the stagnation, but the past 20 years of gas-powered car innovation is certainly less impressive than the last 20 years of mobile phone innovation.
- ^
one that produces a smartphone ~10x more expensive than the most popular iPhone
Discuss
Taking LLMs Seriously (As Language Models)
This is my attempt to write down what I would be researching, if I were working directly with LLMs rather than doing Agent Foundations. (I'm open to collaboration on these ideas.)
Machine Learning research can occupy different points on a spectrum between science and engineering: science-like research seeks to understand phenomena deeply, explain what's happening, provide models which predict results, etc. Engineering-like research focuses more on getting things to work, achieving impressive results, optimizing performance, etc. I think the scientific style is very important. However, the research threads here are more engineering-flavored: I'd like to see systems which get these ideas to work, because I think they'd be marginally safer, saving a few more worlds along the alignment difficulty spectrum. I think the forefront of AI capabilities research is currently quite focused on RL, which is an inherently more dangerous technology; part of what I hope to illustrate here is that there is low-hanging capability fruit in other directions.
When you ask, what answers?Base models are the best, most advanced statistical models humans have ever created. However, we don't use them that way. Instead, we use them as weight initializations for training chatbots. The statistical integrity is compromised by layering on additional training aimed at a variety of goals, trying to warp the statistical model into an intelligent assistant personality.
For example: if I ask ChatGPT to generate plausible news articles from 2030, I don't know whether I'm getting genuine extrapolation from the underlying statistical model, science fiction tropes, text optimized to sound helpful and plausible, etc.
The idea here is to treat LLMs with a more statistical attitude, creating more “handles” for useful and interesting statistical manipulations.
Instead of chat-training an LLM and then asking "Generate news from 2030", I'd try to train a more structured language model by labeling metadata explicitly. We can then do explicit probabilistic inference by _conditioning on_ metadata such as "media type: news article" and "publication date: 2030". This gives us more confidence that we're extrapolating patterns from news articles, rather than sci-fi or other sources.
Of course, since the LLM won't have seen any training examples from 2030, we still fundamentally don't know what we're getting. We don't have a detailed understanding of how LLMs generalize, and we don't have interpretability tools good enough to simply check. The LLM could still be drawing on sci-fi. Still, I think we stand a better chance of getting what we ask for with such an approach.
LLMs are still liable to be a big bag of heuristics, one way or another. To understand what heuristics are answering a question in my proposal, we'd have to understand how Transformers generalize. To answer the same question for ChatGPT, we'd still have to understand that, and we'd have to understand what variety of loss functions got applied when training for relevantly similar cases, and what order those different types of training occurred in, and what the ordering effects are.
The assumption I'm making is something along the lines of: Transformers (sufficiently scaled up) are good at whatever we train them to do. We don't exactly know how they generalize, but we do know that it works fairly well in some sense. (This isn't an assumption I want to make -- I'm ignoring several dangerous AI risks. I just think this direction seems quantitatively less risky than RL.) If we get what we train for, what should we train for? My general answer is legitimate deliberation, but here I offer some more specific and practical ideas. These ideas are all inspired by "coherence" in the probabilistic sense; the aim is to create an LLM with approximately coherent beliefs, rather than the usual slop.
My understanding of current practice instead looks messy. It seems to me like there is some low-hanging fruit available in the form of building more statistically serious LLMs. I aim to outline my ideas here, so that someone interested could run experiments. I'm thinking about the impact similarly to what I wrote in anti-slop interventions.
There's a sort of "honesty" which I think ties together many of the ideas here: not the direct honesty of the LLM telling you true things, but a sense that you know what you're getting. The LLM isn't trying to produce pleasing results (unless that's what you ask for). Such an LLM is a microscope into its dataset. You ask, and the data answers.
Partially Labeled DataI think most people think about modern training something like this: “Base models aren’t trained to tell the truth; they’re just trained to predict the next token. That’s why we need to do chat training, anti-confabulation training, and so on. Generative pre-training just gives us a decent prior. We need to do reinforcement learning and other sorts of fine-tuning to get what we actually want out of it.”
While I don’t deny the success of that paradigm, it misses the fact that training to predict the next token is actually a type of truth; this is real data about the world. Base models know a lot about the world! They merely lack the “handles” to tell us what they know in a way we can easily use. I'm arguing that we can add those handles without sacrificing the statistical integrity of the base models.
The main idea here is to structure the data during training, explicitly labeling metadata such as author, date of publication, topic, etc where such labels are available, rather than throwing everything together in unstructured text.
Aren't these types of metadata already included in the text where available? What's the benefit? One can already prompt a base model with something like a title and a date, and generate from there. What's missing?
Clearly and consistently labeling in a particular way helps us know what we're getting, rather than guessing about how the LLM is interpreting things. We can try to prompt with a document header specifying the information we want to condition on, but a base model will infer things from the structure of the header, and also from the presence or absence of information. This makes prompting base models an opaque and challenging business.
More concretely, I'm imagining replacing the simple [context] -> [completion] pattern with something like [data], [query] -> [answer].[1] The "query" can be simple text-completion like before, or it can ask for the metadata, it can ask for translation into different languages, summaries, elaborations, lots of different things. You can train it to perform whatever transformation or inference you like; the framework is not very restrictive. You can even train it on enough different [query] strings that you hope to generalize to many queries you haven't trained on; still, you get the benefit of clearly separating out the query from the data, removing some of the ambiguity in how these queries are being interpreted.
You could implement this by modifying the neural architecture, or you could take a standard architecture and merely format the input in a standardized way.
InvertibilityIn order to generate something from 2030, we also want to be able to invert a metadata query, trying to generate the data from given metadata. We can generate [date], ["infer the text"] -> [text] training examples easily by swapping around training examples of the form [text], ["infer the date"] -> [date], of course. Training to do this at scale should result in decent answers. However, the result may not be probabilistically coherent: especially for new examples, the inferred metadata given a text may not match the metadata used to generate the text.
This is an illustration of the concept of "honesty" that underlies my proposal: if inverting a relation isn't probabilistically coherent, then in some sense the statistical model isn't revealing everything that it knows, and isn't giving you what you expect.
We could train a "prior" context, perhaps [], ["prior"] -> [text], on unstructured text (much the way that base models are normally trained). This would allow us to apply Bayes' Law (we'd have to approximate, of course) and penalize something like KL divergence between forward-inference and the Bayesian reversal. We might also want priors for specific types of labels (EG priors over authors, dates, etc).
I think the benefits of this sort of probabilistic consistency are real and practical. To illustrate the idea, I'll tell a story about my brother.
My brother uses Claude at work, sometimes, particularly when he's been given more than he can finish so fast-but-unreliable AI answers hit the right cost-benefit tradeoff. At one point, he experimented with using Claude to write emails. Claude has a "style" feature, which is supposed to allow users to easily train Claude to write in a desired style. If you push "Create & edit styles" you'll be prompted to upload a writing example, with an option to describe the desired style manually instead.
My brother uploaded some of his own styles as examples, and Claude generated a style summary for itself, which described my brother's emails as "clear, concise, professional language" and so on.[2] When my brother tested out the new style, the emails came out like they were written by Spock, not correctly imitating my brother's style at all.
I suppose that Claude is bad at this task because Claude hasn't really been trained to do it -- Anthropic is merely leaning on Claude's general competence and hoping that when Claude summarizes a style from text, that summary can later serve to prompt Claude to write in the correct style.
More examples:
- Summary and elaboration should invert each other. This helps enforce the accuracy of both summaries and elaborations.
- Asking for the author of a text should invert a query asking to generate text in that author's style. Sure, maybe the LLM can't imitate an author perfectly; but any imperfection that it can pick up on should be trained out (it constitutes knowledge the model has about the author's style which it isn't successfully applying to text generation). This is similar to the idea behind Generative Adversarial Networks.
- Translating between languages should invert well. This gives evidence that all the information being communicated was correctly captured.
(There may also be queries for which we'd rather optimize against invertibility in specific ways; for example, an anonymization query, for which we optimize against ability to reconstruct the author.)
Do you see why I want to call this sort of thing "honesty", even if I put scare quotes around it? Asking summary and elaboration operations to approximately invert each other doesn't force summaries to be accurate, but if the summary just doesn't make the original text more probable, then either the summary is bad or the elaboration is bad. Asking style summaries to make the example text they're based on more-probable-rather-than-less doesn't necessarily make the style summaries descriptively correct, but it does make them "correct" in a functionally very useful way: if you use the style later, the LLM will make the right inferences.
If you could trust AI to do this sort of thing well, then I imagine there'd be a lot of useful applications, particularly for AI self-prompting. It seems like a useful counter to AI slop: AI outputs which compose well with further AI operations, minimizing loss-of-intent.
ConditioningInversion doesn't yet enable the "news article from 2030" example I started with, since this involves at least two meta-data tags (year of publication, and then something like publication type). Inversion only allows us to condition on one thing.
What's needed is the ability to treat queries as edges in a graphical model; variables in this graphical model are pieces of text (known or unknown), and the queries relate them to each other. For example, generating a text representative of a specific year and topic: the year and topic are known variables, the text is the unknown variable, and year and topic queries connect the text to the year and topic respectively.
However, students of graphical models will know that the conditional probability tables P(A|B) and P(A|C) aren't sufficient to determine P(A|B,C) (although one can use Naive Bayes to heuristically estimate). We need to know how B and C interact with each other.
It seems useful, then, to train the model to answer well with different combinations of information given. Conceptually, the training data is still just (context, query, response) triples; but now, we regard these as defining a graph, and train on small subgraphs rather than only individual edges. The architecture described so far could be used by packing extra information into queries; or, perhaps, the neural architecture could be designed to handle this natively. In either case, the aim is probably just to train the model to resample one variable at a time, given some combination of directly connected information; sampling from more complex graphical models can be built out of this.
In terms of the type of "honesty" I'm pursuing here, it is important to ensure that the conditioning doesn't learn to infer anything from which pieces of information have been provided or left unknown. For example, if most text in the training data that has author labels comes from academic papers, then conditioning on author could cause the model to write in an academic style. This is an undesirable, opaque consequence of conditioning. Conditioning on a specific author should only cause the model to generate something that person might write.
That sort of statistical integrity might be well-served by an approach to training inspired by the EM algorithm, in which the missing labels and the correct weights are jointly inferred. It could also be useful to choose some important metadata to always estimate jointly with text generation; for example, if the topic estimate is incrementally revised during generation and feeds back into generation, this would give us a window into when the LLM settles on a topic.[3]
TransitivityAnother useful type of coherence is transitivity; for example, translating things through a chain of languages should approximately equal a direct translation from one language to another. Querying my father, and then querying the result's father, should be equivalent to querying for my paternal grandfather. Et cetera.
At a technical level, this means queries need to have a notation for compound queries; EG, [context], [query 1; query 2] -> [final result] should be trained to yield the same distribution as first sampling [context], [query 1] -> [intermediate result] and then using the output to sample [intermediate result], [query 2] -> [final result].
More generally, since the previous section discussed how we can compose operations together into complicated graphical models (not just linear chains), we would like the analogue of transitivity for these more complicated compositions: for any complicated probabilistic computation which we could compose out of multiple pieces, we'd also like to train a single query to approximate that result. This involves a corresponding augmentation of the query notation to represent queries on such graphical models.
Entropy PreservationThe term "mode collapse" or sometimes "modal collapse" is a longstanding term for generative models failing to capture the diversity of the training data; it was applied to GANs long before LLMs became prominent in generative AI.[4] The idea behind the term is that the data will have different "modes" (such as pictures of cats vs pictures of dogs), and GANs might learn to produce some of these modes but not others. The term "mode" in statistics refers to the globally most frequently occurring value in the data, but is often abused to refer to local maxima; "unimodal" data has one peak, "bimodal" has two peaks, etc. The term "mode collapse" operationalizes the missing diversity as missing modes. Mysteries of Mode Collapse perpetuated the use of this term in the LLM era.
A related term, "model collapse", refers to degradation specifically due to training AIs on AI-generated data. This exacerbates the problem, naturally, since the artificial data will already be missing modes, and the training process might miss even more.
I don't think "modes" are a great way of thinking of clusters,[5] or apply to LLMs very well;[6] furthermore, the disappearance of clusters is not the only way diversity can be lost. We could keep all the modes but lose variance in all of them.
Therefore, I prefer the more recent term entropy collapse.
Entropy collapse is a big problem for applying LLMs to creative writing in particular. I've gotten the same five story ideas repeatedly when re-sampling from Claude, even when I change the initial prompt. AI-written stories often contain similar scenes and ideas. Some of this is convergence to common cliches, which is already a form of entropy collapse, but a lot of it seems to be making new cliches ("AI-isms") due to AIs over-employing ideas repetitively (such as em dashes, which were fine before AI).
One of the most obvious examples is the predisposition AIs have for specific names, such as "Elara". I don't know of any specific research on AI-generated names, but it seems probable that the name variety is fine in base models (reflecting large biases that exist in the underlying training data, no doubt, but only as bad at name variety as the average author), and the entropy collapse all happens in later training steps.
Source: https://x.com/_lyraaaa_/status/1999386220733006046I can't claim to have a solution to this problem. I don't know exactly what's going on inside frontier labs -- how much this is a really hard problem where all the obvious ideas don't work, vs an easily solved problem which they've only mitigated in dumb ways as depicted in the AI-generated comic above. I suspect that the inconsistent, multi-layered training objectives (which I identified at the beginning as "the problem" which this essay is organized around) contributes to this; in particular, it is easy to see why RL-ish optimization of any kind of broad approval score (like RLHF alignment training, or optimizing user approval) would push models toward a small set of responses. If there is any variation in how much an RL-ish loss likes different names, then an RL-ish training process should bias in favor of the highest-scoring names.
People have at least tried adding entropy-encouraging terms to the loss function; specifically, the loss function in RLHF and some other techniques commonly includes KL divergence from the base model as a component. This seems like a sensible training target, particularly if we suppose that the base model doesn't display entropy-collapse and the other components of loss don't introduce huge biases in cases where we want to maintain diversity.
However, this approach obviously hasn't been enough. Perhaps KL divergence is difficult to balance appropriately with other loss terms, since RLHF aims to significantly shift the distribution in _some_ cases. It's an unprincipled compromise between diversity and other objectives. Or perhaps the KL divergence commonly used is backwards for this purpose.[7]
Besides modifying the KL divergence penalty, I would try adding a lot of training tasks which involve sampling from distributions. For example, queries asking for random numbers drawn from all sorts of statistical distributions. Baby names from specific years, with their accurate frequencies. Given a probabilistic program, generate possible outputs from that program with the right frequencies. Mimic a given stochastic grammar. Generate random words consistent with a given language's phonology.
With enough varied queries of that sort, the hope is, some sampling competence would generalize.
If base models really are better, then perhaps the overall approach outlined in this essay would avoid entropy collapse by default, since I'm trying to avoid the train-base-model-then-fine-tune methodology.
Or, perhaps, this is a problem which requires bigger innovations.
I motivated the problem in terms of AI creative writing earlier, but I think really it is a much more important problem. Bengio argues for its importance in an AI safety context here. If this problem were solved, then semantic entropy would become a very reliable way of detecting AI confabulations. For example, when AI invents plausible citations, it should be sampling from a high-entropy distribution which only knows what citations are like. When AI knows the right answer, the variation in its responses should remain semantically equivalent. This idea is already known to perform fairly well for detecting confabulations, but it can do nothing about cases where the distribution over answers is confidently wrong, as can be caused by entropy collapse. Reduce such cases, and you improve confabulation-detection.
Self-KnowledgeI've proposed to focus on the statistical merits of models: training on structured data to enable well-defined probabilistic queries rather than only text completions, training for various sorts of probabilistic coherence, and focusing on making sure the probability distributions are good rather than just the sampled answers.
However, in order to use an LLM like this as a powerful statistical model, we sometimes want to do other things than sample: we want numbers. We want to know conditional probabilities, correlations, mutual information, etc etc etc.
Of course, all of these things might be estimated by sampling or by other techniques. However, it's natural to train the model on such things, so that we can get approximate answers simply by asking. Done well, this would turn the model into a powerful microscope AI for its training data, and a powerful transparency tool for its own behaviors.
There is some danger in this suggestion: it can improve the situational awareness of the LLM. However, I think this is already happening anyway.
Paraphrase InvarianceIf I query [person description], ["father of"] -> [father of that person], it should produce the same distribution as [person description], ["male parent"] -> [male parent of that person]. The query input has a special property called paraphrase invariance.
Paraphrasing could be learned as a query, [text], ["paraphrase"] -> [paraphrased text].
Paraphrase-invariance offers a specific sort of "no prompt-engineering" guarantee; it says that users can focus on what information to put into the query, and not worry too much about how that information is presented. This connects again with the concept of "honesty" that underlies all of these proposals: if the answer to a question depends on how you say it, then you're not just getting what you ask for.
Paraphrase-invariance strengthens previously-mentioned properties in a way I've been sweeping under the rug until now. Without paraphrase-invariance, transitivity only guarantees that the chained query [person], ["father; father"] -> [ancestor] is statistically the same as iterating the ["father"] operation twice. With paraphrase-invariance, both of these should also be statistically equivalent to the ["paternal grandfather"] query.
As an important example, when I say "paraphrasing should be transitive", what it really means is that ["paraphrase; paraphrase"] should be paraphrasable as ["paraphrase"] -- paraphrasing twice should be the same as paraphrasing once.
Similarly for the invertibility property: without paraphrase-invariance, we would only have that [text], ["summarize"] -> [summary] inverts [summary], ["summarize^{-1}"] -> [text]; relations invert their formal inverses. With paraphrase-invariance, ["summarize"] inverts ["elaborate"], ["child"] inverts ["parent"], etc.
As an important example, ["paraphrase"] should be symmetric, meaning that it is its own inverse: ["paraphrase^{-1}"] should be the same as ["paraphrase"].
Of course, choosing the right notion of paraphrasing then becomes a very significant question when training a model. I'm just giving illustrative examples of what a simple query format could look like and some simple paraphrasings; really doing this, you might not focus on single-word queries like this, you might not use ";" for chaining queries, etc.
(You could also represent a whole space of paraphrase operations, to respect the difference between terms that are equivalent in common parlance vs what's equivalent in technical language, etc. This parameter could be selected via an extra parameter, [context], [query,], [notion of equivalence] -> [answer].)
I think diffusion-based text models might be particularly good for handling paraphrase-invariance and some other properties discussed in this essay; we can try to enforce properties like paraphrase-invariance by trying to approximate stationary distributions of markov kernels (paraphrasing would be a markov kernel here, and with the transitivity property I mentioned earlier, I'm asking that achieves its stationary distribution in one iteration).
I mentioned that the query input should be paraphrase-invariant. Some special designated queries should have paraphrase-invariant _contexts_ as well. These are purely factual questions; how something is phrased should be independent of what is true. EG, story comprehension questions ("Did Lord Draddington leave the house before Lady Sentangle?") should be paraphrase-invariant with respect to the context rather than only the query. (The ability to answer correctly should not depend on how the story presents the information, only on the information presented.) On the other hand, "Does the reader learn this before that" is not a paraphrase-invariant function of the text, since the same information can be presented in a different order.
Another example: if "style" descriptions are made paraphrase-invariant, this gives more reason to expect them to be descriptively accurate; they won't depend on the style of the style-prompt itself.
The choice of what should be paraphrase-invariant vs not serves as a use-mention distinction. When the context is non-paraphrase-invariant, it is mention-like: the answer can depend on the specific phrasing. When the context is paraphrase-invariant, it is use-like: the answer depends on the semantic content alone.
This is very similar to the role of purely functional computations in programming: when things are purely functional, you can evaluate the meaning of parts in isolation. You can rewrite part of a program in a way that returns the same semantic result, and it'll never change the overall result of the program. In the presence of side-effects, re-writes that preserve the returned value of a sub-computation can have other impacts which spoil the equivalence.
Paraphrase-invariance helps us to understand how the LLM is interpreting our query, in the following sense: we get some level of confidence that the model's interpretation isn't based on our phrasing; it treats queries in the same way so long as they mean the same thing. This is a notion of logical equivalence, so in principle paraphrase-invariance could require arbitrarily high intelligence; of course, in practice we have to use a tractable notion of paraphrasing. Still, this helps build confidence that the LLM has understood the query, in the same way that asking someone to rephrase gives you evidence that they've understood. It assures us that the knowledge of the LLM is not too brittle and context-dependent, in a specific sense.[8]
What about chat?Modern LLMs are primarily used in chat interfaces. The sort of system I'm describing has a lot of other uses, so some UI innovation might be needed to best leverage the capabilities described here.
However, it can be applied to chat interfaces as well. The helpful AI assistant persona can be found within the latent space of authors, perhaps by conditioning an unknown author with desirable properties. The resulting agent might be given tools to sample from other queries this LLM is capable of, which gets most of the utility of these capabilities from within a chat interface.
What about safety?I've tried to explain the safety benefits of the approach along the way, but a skeptical reader could see only capabilities here. How is this safety research?
One slogan might be: "AI alignment through coherence properties". The various desirable properties described so far can be related to probabilistic coherence and other desirable properties of probability distributions. (Calibration could well be added as a desirable property, for example.) These statistical properties help us to be sure that we're getting what we ask for, with our queries. This is a non-agentic notion of instruction-following: sampling with good adherence to what's been probabilistically conditioned on.
Alignment to the user's request is important, but it doesn't cover another important type of safety: refusal of dangerous requests (or, if not refusal, "safe completions" which do as much as can be done safely).
As mentioned previously, we can use conditioning to define a desirable AI persona. We might condition the author's goals to be human-aligned, for example. It would be interesting to put such a persona to the test, comparing its performance and alignment to RL-trained agents.
We could also use the model to filter its own outputs for safety, using a well-trained safety query. The robustness of model accuracy here matters a lot. AI enables convenient access to a whole new segment of the space of possible computations, which was much more difficult to access without AI; the problem of which queries to refuse isn't obviously simpler than the general question of what segments of the space of all possible computations are dangerous to freely instantiate here in our world.
I'm not thrilled about continuing to pioneer deeper into the space of all possible computations without a much clearer understanding of which computations are dangerous. However, if I were doing such pioneering, I've described some of the sorts of computations I'd be interested to explore.
- ^
What [context]->[completion] really means, of course, is [context]->[next token]; a completion is generated iteratively by appending tokens to the context. A similar trick needs to be employed for [data], [query] -> [completion], but it's an arbitrary implementation detail, so I wanted to exclude it from the main text.
One way would be [data], [query], [partial answer] -> [next token]. It would also be possible to avoid adding more arguments by packing [partial answer] into one of the existing arguments (this is more similar to what's done with current LLMs), EG [data], [query; partial answer] -> [text token].
It's also possible to inject chain-of-thought reasoning before producing the final answer, which further complicates the implementation details. The proposal avoids RL, but need not avoid chain-of-thought. However, it is worth noting that chain-of-thought should be paraphrase-invariant.
- ^
The style summary included "user examples" which were AI summaries of the actual example emails my brother provided, meaning that Claude had accidentally told itself to act like an AI summary of my brother -- whereas my brother's email style is all about examining different options and quantifying uncertainty, the AI-generated "user examples" didn't carefully examine anything & just confidently stated things.
Sure, all these problems might be fixable by improving Claude's prompt for generating these styles. However, I'm advocating for an approach that would be good at this sort of thing by default, based on general principles.
- ^
An imperfect window, to be sure; we cannot absolutely trust that the model relies entirely on the explicit topic estimate, even if this topic estimate is fed into the model during generation. If you ask the model to continue a text about cake while conditioning on the topic of machines, does the model steer the topic towards machines (like Golden Gate Claude) or does it continue writing about cake? Either behavior is possible, if incorrect labels aren't present in the training data -- it's an out-of-distribution case. However, this may be a property we can train for; it would provide a nice honesty property for this window into the model's thinking.
- ^
The Goodfellow et al paper introducing GANs used the term "Helvetica scenario". However, Goodfellow was using the term "mode collapse" by 2016. I'm unsure if the origins of "mode collapse" are within the GAN community or can be traced further back, however.
- ^
There can be hierarchical clusters; for example, handwritten digit generation has ten macro-level "modes" (the ten numerals), but each digit also has many micro-level modes (seven with or without a horizontal line through it; zero with or without a diagonal line; four with a closed or open triangle; etc). Operationalizing clusters as modes misses this structure, which can be represented by hierarchical clustering.
- ^
LLMs generate sequences token by token, making it unclear what constitutes a "mode" in this context; there is not a continuous space of outputs such that we can define local maxima readily.
- ^
KL divergence is very different depending on which direction you measure it in. KL(p,q) is high if p has modes which q lacks, but does not significantly penalize q having modes which p lacks. I believe RLHF penalizes KL(trained model, base model), meaning it is punishing new behaviors not present in the base model, but not significantly punishing the model for dropping behaviors from the base model.
- ^
Since paraphrase-invariance is a constraint on the model, it acts in much the same way as additional training data. Indeed, it serves to propagate data from one case to another: any data with a paraphrasing of a given query becomes data for that query as well.
Discuss
Objective Questions
Epistemic Status: I wrote this a few days ago while moved by the trolly spirit where I could say "I'm just asking questions, bro!" and smirk with a glint in my eye... but then I showed a draft to someone. It was a great springboard for that conversation, but then the conversation caused me to update on stuff. In general, almost no claims are made in this post, except the implicit claim that these questions are questions that aliens and AIs would naturally wonder about too (if not necessarily with the same idiosyncratic terms... after all, the alien Kant is probably not named "Kant"). I already don't agree that all these questions have that status, but I'd have to go carefully through them to change it, so I'm just throwing this into the public because it DOES seem like decent ART for causing certain conversations. (Also, my ability to track my own motivations isn't 100% (yet (growth mindset!)) and maybe I am just am subtweeting Eliezer very gently for using "people have different opinions about art" as his central example for why moral realism isn't a safe concept?)
Infohazard Status: I would have not published this prior to covid because it talks about layers and layers of "planning to spiritually defect" that I'm pretty sure happen in human beings in DC, but children shouldn't be taught to do too early or the children will be seen badly as they grow up and it would be bad... but since America learned almost nothing from that, and things like that are going to happen again and more by default, it seems like maybe I should put out mildly infohazardous ideas that almost everyone with a brain and the right to vote in the strongest nuclear power on Earth should eventually think anyway, if the ideas, upon social normalization, seem like they could help a democratic polity not make giant stupid fucking errors over and over again.
Robinson Crusoe BasicsDo I have any medium term goals with convergent instrumental utility for other longer term goals or preferences or values of mine, and if so, what are they?
What kind of universe am I in, and are there any currently unknown regularities that might be easy to learn and apply to my planning in a way that makes my plans better?
What kinds of rare but dramatic changes might happen in the environment that are relevant to my goals, that I could prepare for somehow? Floods? Tornadoes? Volcanoes? Earthquakes? Civil wars? Rainy nights? Dry spells? ...Huh, all those are bad... what about Gold Rushes? or Whale Falls? or other Opportunities?
What should I do today?
Should I tend to move around at random exploring new places and roaming in pursuit of new opportunities, or should I stick to a certain area... and if I should stick to a certain area: which area, and why?
Do Predators, Prey, Or Peers Exist?Is there anyone nearby who wants the same things I want, and if so does this similarity make us natural allies (wanting the same public good, perhaps) or natural competitors (wanting the same scarce resource)?
Will I predictably die, and if so, when, and do I care about anything after that point?
Is my body repairing itself a little bit every day, and/or fueling itself, to fight against entropy, and accomplish work, and thus in need of certain inputs, and if so, what are the best inputs to get the best repair for the least effort?
Is anyone watching me very carefully who wants to kill me and take my stuff (especially my daily food, or my body itself which might count as food to some agents, or my long term stored food, or the system that generates my food over time)?
Is there anyone around who understands the Kantian Kingdom Of Ends such that they and I can mutually trust that we will both refrain from pre-emptive murder to avoid the case where someone nearby might not be irremediably deontically unreliable (like rabid dogs for example) and thus need to be pre-emptively eliminated from proximity as a prudent safety precaution? (And likewise for all violations of predictable agentic intent (less than specifically "murder") that might make me or someone else want to pre-emptively murder the other just to cause a thief or litterbug to stop stealing or littering or whatever?)
Can I speak the same language as anyone nearby who understands Kant stuff such that I can ask them if they'll murder me for littering before they pre-emptively try to do so in the belief that my littering puts me in the same general category as a rabid dog? Those Kantians are actually kinda scary sometimes, but I think they "should" be willing and interested in talking, as one of their core convergent moral rules... right?
Are there mindless robots around who don't understand the Kantian Kingdom Of Ends but act in accord with (some idiosyncratic version of it?) out of pure habit and instinct anyway?
If the distinction between mindless Kantians and mindful Kantians is predictably meaningful, which kind of Kantian would I prefer to be around and why?
The Social ContractAre there any large groups of de facto Kantians following good protocols, aligned with ideal Kantian rules as they understand them, that include admission of new agents? If so, would I achieve more of my plans by joining them and following their protocols? How do I evaluate the protocols? Are some protocols better for me than others? If a group has written the protocols down is that a good sign or a bad sign? If a group is hiding its protocols is that a good sign or a bad sign?
Do any of the Kantian protocols I might be expected to follow unavoidably contain protocol defects or not? If I join a group that involves some self sacrifice, that seems obviously bad, but maybe it could good overall, and in prospective expectation for me? OTOH, if such mirotragedies are inherent to some protocols, how is the self sacrifice allocated and how are the self sacrificers held to their now-regretted promise?
Do some protocols predictably lead to large rare bad events like stock market bubbles, or lab-generated plagues, or civil wars, and can I avoid those if I wanted to avoid them, or should I take them for granted and be prepared to exploit the chaos for my own ends? Can group protocols be changed to remove these predictable aggregate stupidities or not? Can they be changed by me or not?
Does the group have methods for soliciting feedback and improving the protocols, like large forums, suggestion boxes, and/or voting and so on? If so, are these feedback methods more liable to be hacked and used to cause decay by predatory agents or groups, or are the feedback methods relatively hardened and coherently safe?
Is it possible that if I join a seemingly Kantian team with a seemingly good protocol that I am actually joining a "social trap" like a team that is required by a half-checked-out billionaire's overworked mid-level scheming lackey to stack rank its members, such that the team lead needs new recruits every year to label as socially defective and then painfully eject, to offer a simplistic (and not actually just) proof to the half blind higher level systems that the team itself is able to label members as socially defective and painfully eject them... such that this could happen to ME?
If the above failure mode happens to other members of the team how should I evaluate whether it serves my longest term goals? Is it scary because I want a better justice system in general because that is likely to conduce to almost any goal I have in the long run, or is it (temporarily?) relieving because I'm "on the inside" of a (temporary?) clique of unjust safety from a default lack of ambient justice? Maybe I should feel relief, but also guilt?
Can a Kantian protocol following groups apply mind reading to see what my private mental answer to the above question is? If so, would I rather be in a group with that power, or a group that lacks such mind reading powers? Would that protect me on net from protocol-non-compliant members, or might it prevent me from joining because maybe I simply can't think thoughts and have feelings consistent with appreciation of some of the protocols that are used by some groups? Or what? Maybe there are other options?
Self ModificationSupposing I want to be in a Kantian protocol following group that CAN apply mind reading to ensure faithful adherence to the group's protocols... in that case, do I maybe have a power that is sort of complimentary to this, to modify my mind to produce the necessary thoughts and feelings? If I don't have this power now, can I acquire it?
Assuming I can modify my repertoire of feelings and reactions, should I? How would I know whether to do so and will I be able (or willing) to undo it later if I made an error?
Should I self modify in a deep way to have shorter term self modification powers that maximize my safety and ability to achieve my idiosyncratic goals by performing lots of self modifications very easily, such as by simply talking to other people, and slowly sharing their protocols by default due to mere proximity?
Granting that I might be in a default state of value plasticity, should I put some of my values or feelings or thoughts or preferences behind a firewall where short term self modification simply isn't possible, and apply security mindset to the protection of this core spiritual material?
What if I notice someone who doesn't change to share the protocols of whoever they are near? Should I report them to the authorities? Should I want to be near them or far from them? What if their refusal is a refusal to water down their protocols to worse protocols that conduce to worse aggregate outcomes for relatively objective economic reasons? Or what if their refusal is a refusal to accept personal sacrifices for the good of the group? Or what if they just seem to lack any self modification powers at all?
What if everyone is self-modifying a little bit pre-emptively out of fear, and doing so in reaction to other people's supposed (but not actual) tendencies and preferences, and preference falsification becomes very very common leading to insane random pursuit of insane group goals that almost no one actually wants? Also, what if this doesn't self correct? How wrong or crazy could things get? Might I end up believing that The Kim Family has magic powers, and genuinely grieving the death of unelected authoritarian rulers who have enslaved me?
In general: which kinds of self-modification tendencies should I be looking for in others and how should I react to them?
Given that I'm pondering how to meta-mind-read others and filter my association with them thereby, how does that interact with other agent's likely thoughts in the same vein about me?
ParanoiaIn asking questions like these, am I slower than normal, faster than normal, or about in the middle? If I'm slower than normal, where can I find advice on "getting up to speed, socially speaking" that isn't just another social booby trap (like online engagement bait, or a multi-level-marketing scheme, or yet another lying politician asking for my vote, or a fully standard advertisement that is full of PhD level manipulation efforts, or a job offer at a stack ranking company, or whatever)? If I'm about in the middle, then why don't I see more people taking the game theory of meta-cognition and group protocol design way more seriously... are smart people hiding or are they defecting by default in secret, such that EVERYTHING around me is a social booby trap? If I'm faster than normal, isn't that just really unlikely to be true? Because doesn't narcissism occurs in ~6.2% of the population, AND wouldn't it be narcissistic to think I'm faster on the uptake than normal, AND SO believing that I'm faster or more clever than the 93.8% "percentile" means I'm just likely to be self deluding based on priors, right?
But if we're taking the possibility of being a very unique person VERY seriously... if I'm in a very historically and sociologically weird situation... am I in a simulation? If so, are the simulators my natural allies or my natural enemies or something weirder? Do they follow any Kantian protocols? How would I even know? What might they want that I could help with, and what might they want that I should try to hinder, and what would cooperation or non-cooperation lead to? Can I break out of the simulation? If or when it reaches a natural end, will I also end, or will I keep going in The Dust, or what? Will I "wake up" and then... uh... be paid for my participation by my demiurge(s)? Or am I their slave? Is this a dangerous way to think?? Have I taken my meta-cognitive ally-weighing schtick too far if I'm trying to use social survival heuristics to judge hypothetical simulators???
Palate CleanserLol! Are my attempts to imagine Objective Questions that almost any agent could or should end up asking getting too silly now? <3
It was fun trying to think of "Objective Questions"!
Happy New Years!
May 2026 be fruitful for you and those you care about! <3
Discuss
Cancer-Selective, Pan-Essential Targets from DepMap
Introduction
Back in June, I proposed that it would be a good idea to look for broad-spectrum cancer treatments — i.e. therapies that work on many types of cancer, rather than being hyper-specialized for narrow subtypes.
There’s nothing fantastic about this notion. After all, some of the oldest cancer treatments (chemotherapy and radiation) are broad-spectrum, and while in some cases it’s possible to outperform them, cytotoxic chemo and radiation are still mainstays of treatment today.
The first thing I proposed was a systematic search for “pan-essential” targets — genes which, when knocked out in cancer cells, cause cell death, but which don’t kill healthy cells.
What I hadn’t realized at the time is that it’s not necessarily tractable to screen genetic knockouts of true “healthy cells”, which don’t grow well in culture. You might be able to do something with short-lived patient-derived samples, or organoids, but we don’t have big public datasets of these.
What we do have is DepMap, an atlas of genetic cancer dependencies (i.e. genes without which cancer cells die). It spans 2119 human cancer cell lines, and also has 13 “normal” cell culture lines and 45 fibroblast lines as controls.
Now, granted, an immortalized cell culture line is not that representative of a healthy cell in a human body, but it’s a starting point.
We can look for gene knockouts that have a strong growth-inhibiting effect on the majority of cancer lines, with little inhibiting effect on the “normal” lines, and rank them by selectivity. We can then filter this list of genetic targets further for their likely druggability, rule out the ones that are already known targets of cancer drugs, and end up with a list of targets worth investigating.
Claude CodeThis is a straightforward data analysis project that in principle I could have done myself — but it would have been a lot of work, and I might not have gotten around to it.
Instead, I had the bot do it.
And what a bot it is!
I am, as the kids say, “feeling the AGI.”
I’d coded with LLM assistance before, but I was always trying to mostly write stuff myself or at least carefully read every line; I wasn’t truly letting the machine off the leash. This time, I did; every line of code in my pan-cancer repo is 100% Claude-generated.
The experience is addictive, to the point of being a bit scary. It would be all too easy to let a lot of programming skills atrophy and let the bot do everything for me.
On the plus side, it means there’s no longer such a thing as a “someday project”, at least when it comes to code. If it can be done with a laptop and public data (or your own data), you can do it right now, frictionlessly.
Dean Ball’s list is an inspiration:
In the past month I have:
Automated invoice creation, sending, and tracking;
Created scientifically realistic simulations of hydrological systems as a learning project;
Automated my research process of gathering and analyzing all proposed state legislation related to AI (though this is no substitute for reading the bill for anything I am going to write about);
Orchestrated a complex chain of autonomous data collection, processing, analysis, and presentation steps related to manufacturing and industrial policy;
Created a machine-learning model capable of predicting US corn yields with what appears to be very high accuracy (the proof will be in the pudding), based on climate, soil, Earth-observation satellite, and other data sources;
Replicated three machine-learning research papers and modified the approach to suit my own research ends;
Performed hundreds of experiments with Byte-level language models, an emerging interest of mine;
Created an autonomous prediction market agent;
Created an autonomous options trader based on a specific investment thesis I developed;
Built dozens of games and simulations to educate myself about various physical or industrial phenomena;
Created an agent that monitors a particular art market in which I am potentially interested in making an acquisition;
Created a new personal blog complete with a Squarespace-style content management system behind the scenes;
Other things I cannot talk about publicly just yet.
If you (like me) have a long wish list of things to do, and are bottlenecked on the gumption or programming skill to do them all, Claude Code is a gamechanger.
So far, it’s been two days, and I’ve already had it make a little mood-tracker app and then overhauled all my web-app-based to-do list and journaling/logging systems and replaced them with a Claude Code-based system on my laptop. And this pan-cancer project.
I notice that I am much less interested in toggling between a bunch of different web apps when Claude Code can just extract their data via an API and let me combine it all seamlessly.
I also notice that I care, suddenly, about efficient and repeatable processes in my daily life, in the way that software engineers always have and I historically haven’t.
You know how playing a video game for several hours makes you see the world around you in terms of the moves in the game? Like playing a first-person shooter and seeing everything as a target? After using Claude Code for hours, everything looks like a protocol to be optimized and made repeatable. Down to “what’s the most efficient set of moves for changing a diaper?”
Your systems can now be your systems. Configured how you like em. Exactly the way command-line wizards have always rolled their own systems, except that now it’s trivially easy.
“But is the code trustworthy?”
In the case of this pan-cancer project, I’m inclined to believe it’s okay. I read through the python scripts and they looked sensible, and (more importantly) I’ve been looking in more detail through the top-50 list of targets and those look sensible.
This is, of course, not enough assurance for a serious piece of software. For that, we’d need a bunch of tests, at a minimum. But as a personal experiment, I’m satisfied. Please do let me know, however, if you see anything fishy in the repo.
(As always, my blog posts, and anything else I sign my name to, are written by me, not by any AI. When I share model output it will be clearly labeled as such.)
The ResultsCell line “dependency” on a gene is measured by the abundance of CRISPR-modified cells relative to wild-type; a negative value for a gene means that the cells became less abundant when that gene is knocked out. This could be due to the knockout causing cell death, slower proliferation, or both.
Here, we define the cutoff for an “pan-cancer essential” gene as having a negative dependency score (effect size) of at least -0.5 in at least 50% of cancers, and of a gene that isn’t essential for normal cells as one that doesn’t have a negative dependency score of more than -0.3 in any normal cell line. This gives us a list of “cancer-specific pan-essential genes.”
Selectivity is defined as (cancer dependency - non-cancer dependency). More selective is better; it means that targeting that gene relatively harms cancer cells more than normal cells.
We look at the top-50 “cancer-specific pan-essential” genes, ranked from most to least selective.
We also cross-reference these top genes with the OpenTargets database to get an estimate of their “druggability.”
OpenTargets has, for each target, a list of criteria that point towards it being druggable with small molecules (things like “high-quality pocket”), with antibodies, or with PROTACs — as well as whether there are approved drugs or clinical-stage drug candidates with that target. We assign a simple scoring rubric (for each drug/biologic type, we assign the target a score of 1 for any structural druggability criteria, 2 for clinical-stage drugs, 3 for approved drugs).
The resulting top-50 targets are as follows:
Notice that the targets of some classic chemo drugs are on here:
#8 is DHFR (the target of methotrexate), while #41 is TOP2A (the target of doxorubicin and etoposide), and #49 is BUB1B (one of the targets of paclitaxel.)
Other chemo targets (for gemcitabine, capecitabine, and 5-fluorouracil) are also somewhat selective, but not enough to make it into the top 50.
There’s generally a fairly high density (18%) of targets with clinical or approved cancer therapies in the top 50, which suggests we’re on the right track and we are indeed looking at a set of cancer targets.
But we’re here to look for new targets. Hopefully, targets that have additional evidence of relevance for cancer, which haven’t yet been subject to a lot of drug development attention, but which are reasonably tractable as drug targets. Especially high selectivity might mean the potential for drugs that work as widely as chemo but have fewer side effects; novel mechanisms of action might suggest drugs that aren’t redundant with existing therapies.
In the rest of this post, I’ll look at a few interesting candidates.
YRDC — Not Yet PursuedYRDC has the top selectivity score, of 0.456, and it is essential in 84.1% of cancer lines.
It is involved in tRNA modification.
Interestingly, “loss‐of‐function mutations in YRDC have been linked to severe developmental disorders—including variants of Galloway–Mowat syndrome and neonatal progeroid phenotypes—that feature telomere shortening, genomic instability, microcephaly, and renal dysfunction”. That suggests that knocking out this gene is killing fast-dividing cancer cells through interfering with genome maintenance.
It localizes to the mitochondria, and it’s especially highly expressed in bone marrow.
It’s overexpressed in a variety of cancers1 and targeting it (with a genetic knockdown) in glioblastoma can reduce tumor growth.2
There are antibodies that target it.
But so far YRDC does not appear to be the subject of a drug development effort.
I don’t know of a reason it should be impossible to target (though the mitochondrial location might present some challenges, and its concentration in the bone marrow suggests standard chemo side effects such as neutropenia would be a concern.)
TFRC — ClinicalTFRC had the #2 selectivity score, at 0.369, and was essential in 67.2% of cancers.
TFRC encodes the transferrin receptor, which enables cellular iron uptake.
Conveniently, it’s a surface protein, and it’s overexpressed in many cancers.3 It seems that cancer cells can be unusually iron-hungry, and starving them of iron can prevent growth and/or progression.
An anti-TFRC antibody-drug conjugate, INA03, recently completed Phase I trials in relapsed/refractory leukemias, where it was “well tolerated” (a phrase which here means “not vastly worse than chemo”) and showed some early signs of efficacy. 4 French company Inatherys was behind the trial.
PHF5A — PreclinicalPHF5A has the #3 selectivity score, with cancer selectivity of 0.369, and it’s essential in 100% of cancer cell lines.
It’s a nuclear protein, a component of the spliceosome protein complex. Like YRDC, it’s involved in genome repair.
It’s also overexpressed in many cancers, and sometimes its expression is associated with worse survival prognosis. Almost all immune checkpoint genes (which protect cancers from the immune system) are positively associated with PHF5A expression.5
And it can apparently be pharmacologically inhibited, with pladienolide B.6 And targeted with antibodies.
I have no special reason to believe it’s either tractable or impossible to make drugs to target PHF5A, but it seems to be an open problem and one that people are working on.
ADSL — PreclinicalADSL has the #3 selectivity score, at 0.318, and is essential in 70.5% of cancer cell lines.
It’s an enzyme involved in making purines, and thus obviously involved in DNA synthesis/repair.
Conveniently, there exist antibodies that target it.
It’s a cytosolic protein, mostly expressed in muscle.
Knocking it down in cancer-prone mice prevents the growth of liver tumors.7 Lots of other studies, all preclinical, find causal effects of ADSL inhibition or knockdown on multiple cancer types’ proliferation or invasiveness, eg. in breast cancer8.
This seems like a fairly tractable target, though not especially neglected.
SEPHS2 — Not Yet PursuedSEPHS2 has the #5 selectivity score, at 0.316, and is essential in 54.9% of cancers.
It’s an enzyme involved in selenium metabolism.
It lives inside the nucleus, in the nucleoplasm, and is especially highly expressed in the liver and colon.
Cancer cells, especially breast cancers, take up a lot of selenium, which can be toxic, and needs SEPHS2 to detoxify it.9
Knocking out SEPHS2 can delay or prevent leukemia in mice10.
There are antibodies that target it, but so far no small molecules. It may be difficult to target due to its nuclear location.
NMT1 — ClinicalNMT1 has the #6 selectivity score, at 0.314, and is essential in 71.1% of cancers.
It is an enzyme that adds the fatty acid myristate to certain proteins, a post-translational modification that prepares them for cell signaling roles (that might be involved in cancer’s abnormal survival and proliferation signaling).
There are lots of associations between NMT1 and cancer progression in various cancers (breast11, liver12, bladder13, etc).
There are also compounds that inhibit it; one of those recently completed a Phase I trial for advanced solid tumors and relapsed B-cell lymphomas,14 led by Canadian company Pacylex Pharmaceuticals.
What’s interesting here is that this seems to be a novel pathway, not based around genome repair/synthesis, and thus potentially non-redundant with older chemo drugs.
ELL — Not Yet PursuedELL has the #7 selectivity score, at 0.310, and is essential in 95.1% of cancers.
It’s involved in gene transcription and lives in the nucleus. It’s a transcription factor that affects oncogenes like Myc. 15
It may be difficult to target due to its nuclear location, and there don’t seem to be any drug development programs targeting it.
GUK1 — Not Yet PursuedGUK1 has a selectivity score of 0.294 and is essential in 94.9% of cancers.
It’s an enzyme involved in guanine metabolism.
It’s particularly highly expressed in skin, and it’s a plasma membrane protein.
It’s been found that you need GUK1 activation for lung cancer cell growth.16
There are compounds that inhibit GUK1 activity.17
So far it doesn’t seem to have been considered as a cancer drug target.
CFLAR — PreclinicalCFLAR has a selectivity score of 0.259 and is essential in 52.4% of cancers.
It’s a regulator of apoptosis that prevents CASP8 from killing cells.
Its expression is a predictor of cancer drug resistance and poor prognosis; mild heat shock can suffice to cause it to aggregate and be removed from the cell; there are also programs underway to develop direct small molecules that target it.18
DNM1L — PreclinicalDNM1L has a specificity score of 0.259 and is essential in 57.2% of cancers.
It’s an enzyme that regulates mitochondrial fission. Normally it’s a cytosolic protein.
Its expression is associated with progression in stomach cancer19, pancreatic cancer20, endometrial cancer21, and others. Many DNM1L inhibitors have been developed, but they’re not specific enough to be good cancer therapies yet.22
TRPM7 — PreclinicalTRPM7 has a specificity score of 0.229 and is essential in 64.5% of cancers.
TRPM7 is a calcium ion channel on the cell membrane.
TRPM7 is necessary for breast cancer metastasis, via its effect on cell-cell adhesion; it makes cells less stiff and thus less able to stick together, loosening them so they can migrate.23 It does the same in ovarian cancer24 and prostate cancer.25 Interestingly, it is itself a mechanosensor that detects stretching of the cell membrane.26
Work is currently being done on developing selective inhibitors of TRPM7.
1Qiao, Li, Yuetong Zhang, and Pin Huang. “YRDC is a Prognostic‐Related Biomarker Correlated With Immune Infiltration and Drug Sensitivity in Pan‐Cancer.” Cancer Reports 8.9 (2025): e70325.
2Wu, Xujia, et al. “Threonine fuels glioblastoma through YRDC-mediated codon-biased translational reprogramming.” Nature cancer 5.7 (2024): 1024-1044.
3Shen, Ying, et al. “Transferrin receptor 1 in cancer: a new sight for cancer therapy.” American journal of cancer research 8.6 (2018): 916.
4Garciaz, Sylvain, et al. “Results of a Phase 1, First-in-Human Study of INA03, an Anti-CD71 Antibody-Drug Conjugate in Patients with Relapsed or Refractory (R/R) Acute Leukemias.” Blood 144 (2024): 1496.
5Ding, Na, Meiping Li, and Xiaokun Zhao. “PHF5A is a potential diagnostic, prognostic, and immunological biomarker in pan-cancer.” Scientific Reports 13.1 (2023): 17521.
6Zhang, Zhandong, et al. “PHF5A facilitates the development and progression of gastric cancer through SKP2-mediated stabilization of FOS.” Journal of Translational Medicine 21.1 (2023): 5.
7Jiang, Tingting, et al. “Targeting de novo purine synthesis pathway via ADSL depletion impairs liver cancer growth by perturbing mitochondrial function.” Hepatology (Baltimore, Md.) 74.1 (2021): 233.
8Zurlo, Giada, et al. “Prolyl hydroxylase substrate adenylosuccinate lyase is an oncogenic driver in triple negative breast cancer.” Nature Communications 10.1 (2019): 5177.
9Carlisle, Anne E., et al. “Selenium detoxification is required for cancer-cell survival.” Nature metabolism 2.7 (2020): 603-611.
10Lin, Charles Y., et al. “An oncogenic enhancer encodes selective selenium dependency in AML.” Blood 134 (2019): 638.
11Deng, Lu, et al. “NMT1 inhibition modulates breast cancer progression through stress-triggered JNK pathway.” Cell death & disease 9.12 (2018): 1143.
12Tan, Xiang-Peng, et al. “Blockade of NMT1 enzymatic activity inhibits N-myristoylation of VILIP3 protein and suppresses liver cancer progression.” Signal Transduction and Targeted Therapy 8.1 (2023): 14.
13Sun, Yi, et al. “N-myristoyltransferase-1 deficiency blocks myristoylation of LAMTOR1 and inhibits bladder cancer progression.” Cancer letters 529 (2022): 126-138.
14Sangha, Randeep, et al. "A first-in-human phase I trial of daily oral zelenirstat, a N-myristoyltransferase inhibitor, in patients with advanced solid tumors and relapsed/refractory B-cell lymphomas." Investigational New Drugs 42.4 (2024): 386-393.
15Liang, Kaiwei, et al. “Targeting processive transcription elongation via SEC disruption for MYC-induced cancer therapy.” Cell 175.3 (2018): 766-779.
16Schneider, Jaime L., et al. “GUK1 activation is a metabolic liability in lung cancer.” Cell 188.5 (2025): 1248-1264.
17Hidalgo‐Gutierrez, Agustin, et al. “Guanylate kinase 1 deficiency: a novel and potentially treatable mitochondrial DNA depletion/deletions disease.” Annals of Neurology 96.6 (2024): 1209-1224.
18Humphreys, Luke, Margarita Espona‐Fiedler, and Daniel B. Longley. “FLIP as a therapeutic target in cancer.” The FEBS journal 285.22 (2018): 4104-4123.
19Zhao, Zhuo, et al. “Prognostic value and immunomodulatory role of DNM1L in gastric adenocarcinoma.” Frontiers in Oncology 14 (2024): 1453795.
20Liang, Jing, et al. “DRP1 upregulation promotes pancreatic cancer growth and metastasis through increased aerobic glycolysis.” Journal of Gastroenterology and Hepatology 35.5 (2020): 885-895.
21Guo, Jing, et al. “Drp1 mediates high glucose-induced mitochondrial dysfunction and epithelial-mesenchymal transition in endometrial cancer cells.” Experimental cell research 389.1 (2020): 111880.
22Mishra, Soumya Ranjan, et al. “Intricate role of DRP1 and associated mitochondrial fission signaling in carcinogenesis and cancer progression.” Biochimica et Biophysica Acta (BBA)-Reviews on Cancer (2025): 189453.
23Middelbeek, Jeroen, et al. “TRPM7 is required for breast tumor cell metastasis.” Cancer research 72.16 (2012): 4250-4261.
24Wang, Zhi-Bin, et al. “Roles of TRPM7 in ovarian cancer.” Biochemical Pharmacology 217 (2023): 115857.
25Chen, Liang, et al. “Downregulation of TRPM7 suppressed migration and invasion by regulating epithelial–mesenchymal transition in prostate cancer cells.” Medical Oncology 34.7 (2017): 127.
26Numata, Tomohiro, Takahiro Shimizu, and Yasunobu Okada. “Direct mechano-stress sensitivity of TRPM7 channel.” Cellular Physiology and Biochemistry 19.1-4 (2007): 1-8.
Discuss
Understanding complex conjugates in quantum mechanics
Why does quantum mechanics use complex numbers extensively? Why is the inner product of a Hilbert space antilinear in the first argument? Why are Hermitian operators important for representing observables? And what is the i in the Schrödinger equation doing? This post explores these questions through the framework of groupoid representation theory. While this post assumes basic familiarity with complex vector spaces and quantum notation, it does not require much pre-existing conceptual understanding of QM.
Roughly, there are two kinds of complex numbers in physics. One is a phasor: something that has a phase. The other is a scalar: a unitless number representing a combined scale factor and phase-translation. Scalars act on phasors by translating their phases. Generally speaking, scalars are better understood as elements of an algebraic structure (groups, monoids, rings, fields), while phasors are better understood as vectors or components of vectors. We will informally use the term "multi-phasor" for a collection of phasors, such as an element of Cn.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} .
An example of a phasor would be an ideal harmonic oscillator, which has a position given by x(t)=asin(ωt+ϕ). Its state is best thought of as also including its velocity x′(t)=aωcos(ωt+ϕ). Note that (x′(t),x(t)ω) has constant magnitude, corresponding to conservation of energy. Now over time, the normalized R2 point moves in a circle. Phase-translating the system would imply cyclical movement through phase space; a full cycle happens in time 2π/ω. A complex scalar specifies both how to phase-translate a phasor, and how to scale it (here, scaling would apply to both position and velocity). By representing the phasor (x′(t),x(t)ω) as a complex number such as x′(t)+x(t)ωi, multiplying by a complex scalar will phase-translate and scale. Here, multiplying by i represents moving forward a quarter of one cycle, though in other representations, -i would do so instead. The phasor is inherently more symmetric than the scalar; which phase to consider "1" in this complex representation, and whether multiplication by i steps time forwards or backwards, are fairly arbitrary.
To understand complex scalars group-theoretically, let us denote by C× the non-zero complex numbers, considered as a group under multiplication. An element of the group can be thought of as a combined positive scaling and phase translation. Let U(1) be the sub-group of C× consisting of unitary complex numbers (those with absolute value 1); see also unitary group. Let R+ be the positive reals considered as a group under multiplication. Now the decomposition C×≅R+×U(1) holds: multiplication by a non-zero complex number combines scaling and phase translation.
First attempt: C×-symmetric setsIf G is a group, let BG be the delooping groupoid: a groupoid with a single object (⋆), and one morphism per element of G. A convenient notation for the category of G-symmetric sets (and equivariant maps between them) is the functor category [BG, Set]. In this case, G=C×.
A C×-symmetric set is a set S with a group action λ∗s (λ∈C×,s∈S) satisfying 1∗s=s and a∗(b∗s)=ab∗s. We now have a set of elements that can be scaled and phase-translated. Hence, S conceptually represents a set of phasors (or multi-phasors), which are acted on by complex scalars.
Let S, T be C×-symmetric sets. A map f:S→T is equivariant iff f(λ∗x)=λ∗f(x) for all λ∈C×. This is looking a lot like linearity, though we do not have zero or addition. To handle additivity, it will help to factor out the R+ symmetry.
Second attempt: U(1)-symmetric real vector spacesWe will use real vector spaces to factor out the R+ symmetry. While we could use R≥0 semimodules (to model negation as action by −1∈U(1)), real vector spaces are mathematically nicer. Let VectR be the category of real vector spaces and linear maps between them.
To get at the idea of using real vector spaces to handle R+ symmetry, we consider the functor category [BU(1),VectR]. Each element is a real vector space with a U(1) action. We can write the action as a∗s for complex unitary a. Note s↦a∗s is linear for fixed a.
Let U, V be real vector spaces with U(1) symmetry. A linear map f:U→V is U(1)-equivariant iff f(a∗x)=a∗f(x) for all complex unitary a.
Now suppose we have opposite-phase cancellation: (−1)∗x=−x for x∈U, which is valid for ideal harmonic oscillators, and of course relevant to destructive interference. We now extend S to a complex vector space, defining scalar multiplication as (a+bi)⋅x=a⋅x+b⋅(i∗x) for real a, b. This is a standard linear complex structure with the linear automorphism x↦i∗x. The assumption of opposite-phase cancellation is therefore the only distinction between a U(1)-symmetric real vector space in [BU,VectR] and a proper complex vector space.
(an abstract representation of the double-slit experiment, depicting opposite-phase cancellation through representation of phasors as colors)
Third attempt: O(2)-symmetric real vector spacesWe see that [BU(1),VectR] is close to the category of complex vector spaces and linear maps between them. Note that U(1)≅SO(2), where SO(2) is the set of rotations in Euclidean space. This of course relates to visualizing phase translation as rotation, and seeing phasors as moving in a circle. While SO(2) gives 2D rotational symmetries of a circle, it does not give all symmetries of a circle. That would be the orthogonal group O(2), which includes both rotation and mirroring. We could conceptualize mirroring as a quasi-scalar action: if action by i rotates a wheel counter-clockwise 90 degrees, then mirroring is like turning the wheel to its opposite side, so clockwise reverses with counter-clockwise.
To make the application to phase translation more direct, we will present O(2) using unitary complex numbers. The group has the following elements (closed under group operations): M(a) for complex unitary a, meant to represent a phase translation, and J, meant to represent a distinguished mirroring. We have the following algebraic identities:
- M(a)M(b)=M(ab)
- J=J−1
- M(a)J=JM(¯¯¯a)
Note that since a is unitary, the conjugate ¯¯¯a=a−1 is an inverse. We can derive that JM(a)J=M(a)−1, so mirroring reverses the way phase translations go, as expected.
Now the category [BO(2),VectR], noting the previous correspondence with complex vector spaces, motivates the following definition. A real structure on a complex vector space V is a function σ:V→V that is an antilinear involution, i.e.:
- σ(σ(v))=v
- σ(λv)=¯¯¯λσ(v) for complex λ
- σ(u+v)=σ(u)+σ(v)
For example, C has a real structure σ(λ)=¯¯¯λ. So the involution σ generalizes complex conjugate.
Now, if U and V are complex vector spaces with real structures σU,σV, then a linear map f:U→V is σ-linear iff it satisfies f(σu(x))=σV(f(x)) for all x∈U. This condition accords with O(2)-equivariance.
While real structures are useful in quantum mechanics (notably in the theory of C* algebras), they are not well suited for quantum states themselves. Imposing a real structure on the state space, and a corresponding σ-linearity condition on state transitions, is too restrictive for Schrödinger time evolution.
Fourth attempt: O(2) as a groupoidSince O(2) has two topologically connected components (mirrored and un-mirrored), it is perhaps reasonable to separate them out, like two copies of U(1) glued together. Conceptually, this allows conceiving of two phase-translatable spaces that mirror each other, rather than treating mirroring as an action within any phase-translatable space. We consider a "polar unitary groupoid" U(1)±, which has two objects ⋆+,⋆−. For complex unitary a, we have morphisms M+(a):⋆+→⋆+ and M−(a):⋆−→⋆−, which compose and invert as usual for U(1). We also have an isomorphism J:⋆+→⋆− satisfying J∘M+(a)=M−(¯¯¯a)∘J. U(1)± relates to O(2) through a full and faithful functor (J↦J,M+(a)↦M(a),M−(a)↦M(a)). The only essential difference is in separating the two connected components of O(2) into separate objects of the groupoid U(1)±.
Now we can consider the functor category [U(1)±,VectR]. An object of this category picks out two U(1)-symmetric real vector spaces (of which complex vector spaces are an important class), and provides a real-linear isomorphism between them corresponding to J; this isomorphism is not, in general, complex-linear. Importantly, the groupoidal identity J∘M+(a)=M−(¯¯¯a)∘J yields a corresponding fact that the two U(1)-symmetric real vector spaces have opposite phase-translation actions.
To simplify, we can achieve opposite phase-translation actions as follows. Let V be a complex vector space. Let ¯¯¯¯V be a complex vector space with the same elements as V and the same addition function. The only difference is that scalar multiplication is conjugated; a⋅¯¯¯¯Vv=¯¯¯a⋅Vv. We call ¯¯¯¯V the complex conjugate space of V.
Improving the notation, if v∈V, we write ¯¯¯v∈¯¯¯¯V for the corresponding vector in the complex conjugate space. Note the following:
- ¯¯¯¯¯¯¯¯¯V=V
- ¯¯¯¯¯¯v=v
- ¯¯¯λ¯¯¯v=¯¯¯¯¯¯λv for complex λ
- ¯¯¯u+¯¯¯v=¯¯¯¯¯¯¯¯¯¯¯¯u+v
The choice of ¯¯¯v notation here is not entirely standard (although ¯¯¯¯V is standard), but is convenient in that, for example, ¯¯¯λ¯¯¯v looks like ¯¯¯¯¯¯λv, and they are in fact equal.
Let U and V be complex vector spaces. If f:U→V, define ¯¯¯f:¯¯¯¯U→¯¯¯¯V as ¯¯¯f(¯¯¯u)=¯¯¯¯¯¯¯¯¯¯f(u). This definition matches what we would expect from morphisms (natural transformations) in [U(1)±,VectR]. By treating f and ¯¯¯f as separate functions, we avoid the rigidity of σ-linearity.
What we have finally derived is a simple idea (complex vector spaces and their conjugates), but with a different groupoid-theoretic understanding. Now we will relate this understanding to quantum mechanics.
The inner productIn bridging from complex vector spaces to the complex Hilbert spaces used in quantum mechanics, the first step is to add an inner product, forming a complex inner product space. Traditionally, the inner product is a function ⟨−,−⟩:H×H→C, where H is a Hilbert space (or other complex inner product space). While the inner product is linear in its second argument, it is notoriously anti-linear in its first argument. So while on the one hand ⟨u,λv⟩=λ⟨u,v⟩, on the other hand, ⟨λu,v⟩=¯¯¯λ⟨u,v⟩. Also, the inner product is conjugate symmetric: ⟨u,v⟩=¯¯¯¯¯¯¯¯¯¯¯¯⟨v,u⟩.
The anti-linearity and conjugate symmetry properties are not initially intuitive. To directly motivate anti-linearity, let ψ∈H be a quantum state. Now the inner product $\langle \psi , \psi \rangle $ gives the square of the norm of the state ψ, as a non-negative real number. If the inner product were bilinear, then we would have ⟨iψ,iψ⟩=i2⟨ψ,ψ⟩=−⟨ψ,ψ⟩. But multiplying ψ by i is just supposed to change the phase, not change the squared norm. Due to antilinearity, ⟨iψ,iψ⟩=¯ii⟨ψ,ψ⟩=⟨ψ,ψ⟩ as expected.
Now, the notion of a complex conjugate space is directly relevant. We can take the norm as a bilinear map ⟨−,−⟩:¯¯¯¯V×V→C. The complex conjugate space ¯¯¯¯V gracefully handles antilinearity: ⟨¯¯¯¯¯¯λu,v⟩=⟨¯¯¯λ¯¯¯u,v⟩=¯¯¯λ⟨¯¯¯u,v⟩. And we recover conjugate symmetry as ⟨¯¯¯u,v⟩=¯¯¯¯¯¯¯¯¯¯¯¯⟨¯¯¯v,u⟩; the overlines make the conjugate symmetry more intuitive, as we can see parity of conjugation is preserved.
Using the universal property of the tensor product, we can equivalently see a bilinear map ¯¯¯¯V×V→C as a linear map ¯¯¯¯V⊗V→C; for the inner product, this corresponding map is ⟨¯¯¯u⊗v⟩=⟨¯¯¯u,v⟩. This correspondence motivates studying the complex vector space ¯¯¯¯V⊗V.
Real structure on tensor productsThe space ¯¯¯¯V⊗V has a real structure, by swapping: σ(¯¯¯u⊗v)=(¯¯¯v⊗u). To check:
σ(λ(¯¯¯u⊗v))=σ(¯¯¯u⊗λv)=¯¯¯¯¯¯λv⊗u=¯¯¯λ(¯¯¯v⊗u)=¯¯¯λσ(¯¯¯u⊗v)
as desired. Of course, C also has real structure, so we can consider σ-linear maps ¯¯¯¯V⊗V→C. First we wish to check that the tensor-promoted inner product is σ-linear: ⟨σ(¯¯¯u⊗v)⟩=⟨¯¯¯v⊗u⟩=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯⟨¯¯¯u⊗v⟩.
Noticing that the inner product is σ-linear of course raises the question of whether there are other interesting σ-linear maps ¯¯¯¯V⊗V→C. But we need to bridge to standard notation first.
Bra-kets and dual spacesTraditionally, a 'ket' |ψ⟩ is notation for a vector in the Hilbert space H. A 'bra' ⟨ψ| is an element of the dual space of linear functionals of the form H→C; this dual space is called H∨. We convert between bras and kets as follows. Given a ket v=|ψ⟩, the corresponding bra is ⟨v,−⟩∈H∨, which linearly maps kets to complex numbers. The ket-to-bra mapping is invertible, and antilinear, due to Riesz representation.
In our alternative notation, we would like the dual V∨ to be linearly, not antilinearly, isomorphic with ¯¯¯¯V. This is straightforward: given ¯¯¯u∈¯¯¯¯V, we take the partial application ⟨¯¯¯u,−⟩∈V∨. This mapping from ¯¯¯¯V to V∨ is a linear isomorphism when V is a Hilbert space: ⟨λ¯¯¯v,−⟩=λ⟨¯¯¯v,−⟩ (note the non-standard notation!). As such, ¯¯¯¯V≅V∨; the dual space is isomorphic to the complex conjugate space.
Tensoring operatorsWe would now like to understand linear operators, which are linear maps A:V→V. Because V≅¯¯¯¯V∨, we can see the operator as a linear map V→¯¯¯¯V∨, or expanded out, V→(¯¯¯¯V→C). Tensoring up, this is equivalently a linear map ¯¯¯¯V⊗V→C. Of course, this is related to the standard operator notation ⟨ϕ|A|ψ⟩; we can see the operator as a quadratic form in a bra and a ket.
More explicitly, if A:V→V is linear, the corresponding tensored map is A⊗(¯¯¯u⊗v)=⟨¯¯¯u⊗Av⟩. We would like to understand real structure on linear operators through real structure on tensored maps of this type. If f:¯¯¯¯V⊗V→C is linear, we define the real structure σ(f)(¯¯¯u⊗v)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯f(σ(¯¯¯u⊗v))=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯f(¯¯¯v⊗u). As a quick check:
σ(λf)(¯¯¯u⊗v)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯λf(¯¯¯v⊗u)=¯¯¯λ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯f(¯¯¯v⊗u)=¯¯¯λσ(f)(¯¯¯u⊗v).
We can apply this real structure to A⊗:
σ(A⊗)(¯¯¯u⊗v)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A⊗(¯¯¯v⊗u)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯⟨¯¯¯v,Au⟩=⟨¯¯¯¯¯¯¯Au,v⟩.
By definition, the Hermitian adjoint A†:V→V satisfies ⟨¯¯¯u⊗Av⟩=⟨¯¯¯¯¯¯¯¯¯A†u⊗v⟩; note (A†)†=A. As such,
(A†)⊗(¯¯¯u⊗v)=⟨¯¯¯u⊗A†v⟩=⟨¯¯¯¯¯¯¯Au,v⟩
Therefore, σ(A⊗)=(A†)⊗. This justifies the Hermitian adjoint as the canonical real structure on the linear operator space V→V, as is standard in operator algebra.
Now we can ask: When is A⊗ σ-linear?
A⊗(σ(x))=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A⊗(x)⇔¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯A⊗(σ(x))=A⊗(x)
⇔σ(A⊗)(x)=A⊗(x)⇔(A†)⊗(x)=A⊗(x)
Assuming V is a Hilbert space, this holds for all x∈¯¯¯¯V⊗V iff A=A†, i.e. A is Hermitian. This is significant, because Hermitians are often used to represent observables (such as in POVMs). It turns out that, among linear maps V→V, the Hermitians are exactly those whose corresponding tensored maps A⊗ are σ-linear.
Let a member of a complex vector space x∈X with a real structure σ be called self-adjoint iff σ(x)=x. As an important implication of the above, if A is Hermitian, then A⊗ maps self-adjoint tensors (such as those corresponding with density matrices) to self-adjoint complex numbers (i.e. real numbers). This is, of course, helpful for calculating probabilities, as probabilities are real numbers.
Unitary evolution and time reversalWhile Hermitian operators are those for which A†=A, unitary operators are those for which B†=B−1. We will consider time evolution as a family of unitary operators U(t) for real t, which is group homomorphic as a family (U(0)=I,U(a+b)=U(a)U(b)).
A simple, classical example of unitary evolution is that of a phasor representation of a simple harmonic oscillator, x′(t)+ωx(t)i=aeωti+θ (for real a). The unitary evolution is given by u(t)=eωti∈U(1), a multiplicative factor on the phasor to advance it in time. By convention, we have decided that time evolves in the +i direction (multiplicatively), assuming 0">ω>0. We can find this direction explicitly by differentiating: u′(0)=ωi.
With classical phasors, it is easy to see what physical quantities the representation corresponds to; here, the imaginary part of the phasor represents the position multiplied by the frequency ω. Interpreting quantum phasors is less straightforward. We can still take the derivative U′(t), which approximates U(t) as U(ϵ)≈I+ϵU′(0) as ϵ→0. We recover U(t) through the matrix exponential U(t)=etU′(0), which generalizes u(t)=etu′(0) in the single-phasor case.
Because the family U(t) is unitary, we have U′(0)=−iH for Hermitian H; note −H is Hermitian iff H is. In the specific case of the Schrödinger equation, H=ℏ^H where ^H is the Hamiltonian, and ℏ is the reduced Planck constant (a positive real number). The direction of the action of i in quantum state space is meaningful through the Schrödinger convention U′(0)=−iℏ^H (as opposed to U′(0)=iℏ^H).
Complex conjugate therefore relates to time reversal, though is not identical with it. U(t)=etU′(0)=e−itH, while U(−t)=e−itU′(0)=eitH. In the real structure on linear operator space V→V given by (−)†, a Hermitian is self-adjoint, like a real number in C, while U′(0) is skew-adjoint ((U′(0))†=−U′(0)), like an imaginary number in C (i.e. bi for real b).
To bridge to standard time reversal in physics, complex conjugates relate to time reversal operators in that a time reversal operator T:H→H (satisfying T−1U(t)T=U(−t)) is anti-linear, due to the relationship between time and phase. In the simpler case, T2=I, though in systems with half-integer spin, T2=−I; see Kramers' theorem for details. In the latter case, time reversal yields a quaternionic structure on Hilbert space (rather than a real structure). In relativistic quantum field theory, one may alternatively consider the combined CPT operation, which includes time reversal but typically squares to the identity. Like time reversal, CPT is anti-linear; either C or P on its own would be linear, so the anti-linearity of CPT necessarily comes from time reversal.
Complex conjugation is not itself time reversal, but any symmetry that reverses time must conjugate the complex structure; CPT is the physically meaningful anti-linear involution that accomplishes this. The sign applied to i in the Schrödinger equation is not an additional law of nature, but a choice of orientation. Nature respects the equivalence of both choices, while observables live in the self-adjoint subspace where the distinction disappears.
ConclusionGroupoid representation theory helps to understand Hilbert spaces and their relation to operator algebras. It raises the simple question, "if action by a unitary complex number is like rotating a circle, what is like mirroring the circle?". This question can be answered precisely with a real structure, and less precisely with a conjugate vector space. The conjugate vector space helps recover a real structure, through the swap on the tensor product ¯¯¯¯V⊗V, which relates to the inner product and the Hermitian adjoint (−)†.
While on the one hand, complex conjugation is a simple algebraic isomorphism (if i is a valid imaginary unit, then so is -i), on the other hand it has a deep relationship with physics. The Schrödinger equation relates i to a direction of time evolution; complex conjugation goes along with time reversal. The Hermitian adjoint, as a real structure on linear operator space, generalizes complex conjugate; it keeps Hermitians (such as observables and density matrices) the same, while reversing unitary time evolution.
Much of the apparent mathematical complexity of quantum mechanics clicks when viewed through representation theory. Algebra, not just empirical reality, constrains the theoretical framework. Geometric representations of physical algebras serve both as shared intuition pumps and as connections with the (approximately) classical phenomenal space in which empirical measurements appear. Understanding the complex conjugate through representation theory is not advanced theoretical physics, but it is, I hope, illustrative and educational.
Discuss
[Linkpost] On the Origins of Algorithmic Progress in AI
This is a linkpost to a new Substack article from MIT FutureTech explaining our recent paper On the Origins of Algorithmic Progress in AI.
We demonstrate that some algorithmic innovations have efficiency gains which get larger as pre-training compute increases. These scale-dependent innovations constitute the majority of pre-training efficiency gains over the last decade, which may imply that what looks like algorithmic progress is driven by compute scaling rather than many incremental innovations.
From the paper, our core contributions are:
- We find most algorithmic innovations we experimentally evaluate have small, scale-invariant efficiency improvements with less than 10× compute efficiency gain overall, and representing less than 10% of total improvements extrapolated to the 2025 compute frontier (2 × 10²³ FLOPs). This suggests that scale-invariant algorithmic progress contributes only a minor share of overall efficiency improvements.
- We find two strongly scale-dependent algorithmic innovations: LSTMs to Transformers, and Kaplan to Chinchilla re-balancing. Together, these account for 91% of total efficiency gains when extrapolating to the 2025 compute frontier. This implies that algorithmic progress for small-scale models is several orders of magnitude smaller than previously thought.
- We show that in the presence of scale-dependent innovations, not only do efficiency gains require continued compute investment, but the rate of algorithmic progress strongly depends on your choice of reference algorithm. In other words, the rate of progress in successive models can appear exponential relative to one baseline algorithm, yet be zero relative to another.
MIT FutureTech is an interdisciplinary lab at the intersection of computer science and economics, focused specifically on trends in AI and computing, and funded in part by Coefficient Giving.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- следующая ›
- последняя »