Вы здесь
Новости LessWrong.com
What are you excited about doing?
The other day, on the local college campus, I noticed that people who are walking by themselves are way more likely to be looking at their phone (or wearing headphones) than people who are walking with someone else. Specifically, about three times as likely: ~71% (n=80) vs 25% (n=40).
This sounds extremely obvious, but there’s something there: People are more likely to turn to their phones when they’re not engaged in something else. (In this case, talking to their friends.)
Okay, that also sounds obvious. But! Maybe this isn’t:
It follows that if you want to spend less time sucked into your devices, you should spend more time engaged in other things.
Great, you say. But what other things!! What is engaging besides phones!?
Fair question. Well, it’ll be things that you find engaging, and I don’t know what those are. Do you?
Some people really know what they want, somehow — my sister, for example, just wants to write books and take care of her family. If you’re like her, great, you already know what you value, so just do more of that.
But I am not like that, and maybe you aren’t either.
So, here are a few prompts that I think are good at surfacing things that you value, or that excite you.
I asked a friend to describe his ideal day, and his answer was “spend all day climbing with my friends, and all night dancing with my friends”. That sounds unsustainable for more than like one day, but it certainly points in the direction of some concrete activities: climbing, dancing, and spending time with friends.
What would an ideal day look like for you?
I got my first phone when I went to college, and I quickly became embroiled in fraught messaging with a terrible boy who I was obsessed with. (Ah, youth.) That year for Thanksgiving, my sweet little cousins visited from Asia for the first time in three years. I loved those cousins and I only had a few days with them, but I spent most of the holiday standing in the corner, checking my phone for messages from a terrible boy.
It’s been over a decade, but I still remember my cousin coming and tugging on my leg and saying in her plaintive little voice, “Oh, please won’t you come and play with us?” And I still regret spending that time on my phone instead of with her.
What do you want to pay attention to?
I had this dress I really loved, but it was broken in a way that needed a sewing machine to fix, and I didn’t know how to use a sewing machine. So my beloved dress lay in a sad pile for more than two years. Then, one day, a friend came over and showed me how to set up my sewing machine. I looked up how to use it, and practiced, and I made tons of rookie mistakes, but I got better. And eventually I felt ready to fix the dress, and now I can wear it again.
Or, like everyone, I had a lot of books on my shelf that I always meant to read, but never found the time. Now I’ve read a lot of them, and it turns out some of them are bangers.
What do you always mean to do but never get around to?
When I was in college I doodled all the time, expressing my overflowing angst with dark ballpoint pen and tasteful splashes of red. I’ve never thought of myself as someone who knew how to draw, because the only real-world objects I can vaguely reproduce are trees and single eyes. But I really liked this doodling.
What did you used to love to do that you never do anymore?
.
.
.
.
So, now you’ve thought about things to do that might be exciting / rewarding / interesting / engaging / better than your phone! Will you suddenly find the time and motivation to just go do them? Maybe! Probably not. But at least you know what they are now. That will be useful going forward. Stick with me.
Discuss
Intentionality
I want to point to an important variable of personal experience: how much are you consulting with an explicit representation of what you intend to be doing?
I think "intention" is a better handle than most, for what I'm trying to point at.[1] I think a common handle would be "should" -- as in "what I should be doing". But you can think you "should", say, go to the dentist, while having no intention of doing so. I want to point at a more behaviorist notion, where (in order to count) an explicit representation of your goals is a signal which you are at least sometimes responsive to; causal reasons why you do one thing rather than another.[2]
So, for example, I keep a notebook open on my desk, where I write to-do items. If I write something in the notebook, it explicitly sets the intention to do the thing, and it remains in my field of view. I might follow up on it immediately, in which case the external memory was not really useful as memory but rather as a clear signal to myself that it was a priority for me.
I might also spend the day in the living-room, where the work notebook is not visible. Where I sit is another sort of representation of what I intend: if I'm seated at my work desk, I almost always intend to be working, whereas if I'm seated in the living room, I intend to be relaxing ("doing whatever I want" -- which can include work-like things, but approached with a more playful attitude).
My thoughts can also serve as "explicit representations" in the relevant sense: mentally labelling something as a "work day" or "break day" sets an intention, lodged in memory, which I may consult later to guide my behavior.
I want to talk about that variable in general: how much you consult explicit representations of what you intend to do, whether they're mental representations or physical representations.
At the extreme explicitness-of-will direction, you would have someone who is engaged in a deeply-nested goal-stack, where they are constantly explicitly checking what things they have to do next, both with a lot of explicit content in working memory, and longer-term memory in the form of personal memory and external records like to-do lists.
The opposite end of the spectrum is spontaneous play, doing whatever feels most alive, reacting to your current situation. I'm not ruling out accessing memory at all, so it's not necessarily a myopic state of being; just more calculating what you want from "is" beliefs rather than from "ought" beliefs.[3]
So, intentionality vs spontaneity?
If you're being very intentional, your explicit goal-representations had better be accurate. (Or, to put it a different way, they'd better represent valuable goals.) If your to-do lists become disconnected from what you really (implicitly) want, its purpose has been lost. Akrasia might be a defense against this.
Forming accurate explicit representations of your goals can obviously be very helpful, but spontaneity isn't necessarily non-agentic. When you're in the flow, you might be very productive without being very intentional in the current sense.
Humans have a sleep/wake cycle, but we also seem to need (or at least, express a need for) a different kind of rest: a work/play cycle (work during the day and relax in the evening, work during weekdays and relax during weekends, take vacations every so often, that sort of thing). The notion of spontaneity here seems like a reasonably good model of the point of evenings, weekends, and vacations: doing things because they feel good, because they're alive for you in the moment, rather than making and completing to-do lists. (Of course, some people won't fit this model.)
One possible reason to have this sort of work/play cycle might be to recalibrate your explicit models of what you enjoy and what you want; time spent in spontaneity mode can serve as a sort of check, providing more data for intentional mode.
A different (more Hansonian) model: your explicit representations of what you want are always skewed, representing what you "should" want (for signaling purposes). This means time spent in intentional mode will under-satisfy some of your drives. You need time in spontaneous mode to satisfy the drives you don't want to explicitly represent.
- ^
One argument against this choice is that the nearby namespace is already overloaded in philosophy, with "intension" vs "intention" vs "ententional" all being given specific meanings.
- ^
It isn't totally clear this modeling choice is the right one.
- ^
And also calculating your choices more from beliefs about the external world rather than beliefs about yourself; a belief that you "will do this" functions an awful lot like a belief that you "should do this" sometimes, so we also need to rule that case out to make our spectrum meaningful.
Discuss
Food-related things that have made my life a little better
I've got some drafts in the oven as I continue my Halfhaven writing, but nothing that's ready to go out the door today, so I thought I'd follow the example of some of my compatriots and make a list. I expect some of these to be fairly idiosyncratic so consider with caution before adoption.
1. Learning how to cook pretty well. There's financial benefits to cooking your own meals, there's friendship benefits to cooking food for events, there's aesthetic benefits, health benefits. I think in addition to all of these cooking your own food can help put you more in tune with your body, since you can much more easily adjust and experiment with things you like more or less and things that make your body feel more or less good afterwards.
2. Eating meals to the extent that I want to at the times I want to. I've done everything from eating one meal a day at a regular time with a small snack earlier in the day to regular three meals a day to an irregular 1-3 meals a day at inconsistent times and they've all felt fine. The one main meal a day was good for losing a bit of weight, and in general, I'm able to feel really flexible in my ability to eat more or less for any given occasion. I'm able to go a fairly long time between eating without feeling bothered by hunger, and I'm fine with eating a lot of food in a short amount of time, which can be convenient. I believe this is something that can be trained, since this certainly wasn't always the case for me.
3. At least one time per day, I put conscious effort into thinking about the food that I'm eating. It either helps me be more thankful that I live in a world where I can easily access more than enough yummy food and/or helps enhance the aesthetic enjoyment.
4. Use spices! Use a lot of spices! Put 'em on and see what happens! Eventually you start learning about how different flavors mix but until then just keep putting spices on. (Do learn at least some basics about what goes with what beforehand)
5. I purchase very little in terms of snack/dessert-ish content. If there is food in the house, odds are that I am going to eat it before too long and having lots of snacks/dessert is bad for me. I'm not even sure about how it effects me at the nutrient level but they'll just generally be protein-scarce and I try to eat relatively high in protein. Also by avoiding the sort of hyper-palatable snack food that is really easy to eat, it's much more likely I'll naturally arrive at eating an amount of food that is appropriate for me. Sometimes I try making my own snacks and that's more fun.
There's a short list that manages to fulfill my word obligations. If you read this far, do you have any advice/recommendations on getting started with baking? I've done very little on that front so far but would like to go further.
Discuss
Gerrymandering California
The governor and legislature of California are trying to gerrymander the U.S. House of Representatives districts in the state. For background, see this article on 2025–2026 United States redistricting. See GradientDissenter's strategic analysis. Below is my own attempt at an analysis, pretty similar to GradientDissenter's.
Gerrymandering would hinder Trump from gathering power.If the Republican party maintains majorities in both houses after the midterm elections, Trump will continue to consolidate power easily. A Republican-majority congress could pass laws that give Trump more power to do the things that he wants.
On the other hand, if the Democrats get a majority in the House, the Republicans will have to negotiate with them, or wait for an emergency to get them on board; or else Trump will have to work within existing laws, or break the law. Reducing Trump's set of options will slow him down, which limits the amount of damage he can do.
If both sides gerrymander, democracy suffers.Gerrymandering is bad for the same reasons democracy is good.
When a state legislature redraws congressional districts to favor a particlar party, it takes agency away from the voters and gives it to the state legislators. More precisely, this attenuates the causal chain (voter → U.S. representative) and strengthens the chain (voter → CA legislator → district boundaries → U.S. representative). This decreases the legitimacy of the U.S. House of Representatives, because democratic oversight is less direct, and state politicians have more power over who is elected.
In particular, gerrymandering now would somewhat delegitimize the midterm election from the perspective of disenfranchised Republican voters in formerly red districts. I can't predict what concrete effects this would have, but I could maybe imagine, for example, the Trump administration ordering federal agents or soldiers in California to jail Democratic representatives, if those agents felt those representatives really weren't legitimate.
I can think of a couple other ways gerrymandering is bad for democracy: Gerrymandering decreases the number of votes required to support an atrocity. (Theoretically, a party could get a majority in the House with 25% of the popular vote.) And maybe gerrymandering tends to give parties larger majorities in the House.
There are weak norms against gerrymandering.Most states don't have an independent redistricting commission. Instead, districts are usually drawn directly by the legislature, so we should expect lots of gerrymandering.
Indeed, see wikipedia for examples of gerrymandered districts from recent history. The Gerrymandering Project tries to measure the partisan bias of all the districts in the country; but it doesn't tell us whether those biases were achieved on purpose or by accident.
There are federal laws that limit gerrymandering to some extent. See here for some recent lawsuits.
The escalationCalifornia's gerrymandering plan is narrowly scoped to retaliate against Texas's:
- It adds 5 Democratic representatives, just as Texas added 5 Republican representatives.
- It only touches congressional districts, not the districts for the state legislature.
- And it will only last until/through 2030, when the independent redistricting commission will be allowed to redraw the districts.
Other red states gerrymandered along with Texas. Virginia (blue) might gerrymander in retaliation to North Carolina (red). California's gerrymandering might inspire the Trump administration to try to get more red states to gerrymander. But keep in mind that California is the biggest state, and not all states are willing or able to do this.
Other concernsI'll briefly mention some weaker considerations I read or thought of, before synthesizing the above considerations.
Gerrymandering would be unfair to red voters in newly-blue districts. Specifically, it would reduce the power of their vote and give it to their blue neighbors. It wouldn't affect their influence over policies in their district, which is determined by the state government. And their lessened influence over federal policies is balanced by the red gerrymandering in Texas. If I had to pinpoint the unfairness here, it would be in the intersection of partisan and regional politics, like the flavor of pork. That doesn't seem that important to me.
Some proponents of California's gerrymandering plan say the proposed maps are good. But the properties of the maps aren't what's strategically or ethically important here. What's important is that the maps were chosen in order to produce 5 more Democratic representatives.
Instead of gerrymandering, the California Democratic Party could try to field conservative candidates in red districts, thus providing representatives that voters like and who aren't beholden to the Trump administration. Maybe they should do this, but it's an independent decision from whether to gerrymander.
SynthesisFrom the standpoint of the Democratic side, if the 2026 election doesn't go well, there might not be a meaningful election in 2028. That makes it matter a lot, but doesn't by itself tell us the right thing to do.
From the standpoint of individual citizens, gerrymandering is a somewhat common political tool that also damages the institutions that protect citizens. It's better if no one gerrymanders.
Gerrymandering is considered bad, but some amount of it happens.
From the standpoint of the citizens of one state among many, a good thing to do is to gerrymander less than the other states — California's independent redistricting commission has been a good example of this.
From the standpoint of citizens supporting one political party, the thing to do is to reward good districting with good districting, and punish gerrymandering with gerrymandering, in such a way that the equilibrium (if the other party behaved similarly) is no gerrymandering. California's proposal is a good example of this.
Discuss
How to survive until AGI
In my previous post, I made the case that surviving until AGI seems very worthwhile, and that people should consider taking actions to make that more likely. This post goes into what the most low-hanging fruit are for surviving until AGI. I’ll assume that AGI is less than 20 years away.
I’ll rank the various interventions by how many micromorts they reduce (1 micromort = 1 in a million chance of dying). The optimal strategy varies a bit based on how old one is because the older you get, the more of your micromorts are chronic rather than acute.
Main anti-recommendationsHard drugs (tens of thousands of micromorts a year)If you’re a young person, by far the easiest way to reduce your risk of dying in the next 20 years is to never take hard drugs. Using heroin costs around 30 micromorts per trip, using MDMA costs around 13 micromorts per trip[1], and a daily use of heroin has a mortality rate of around 2% annually (which is 20 thousand micromorts). Trying any hard drug is probably a very bad idea, as it runs the risk of turning into a very dangerous addiction.
Mountaineering (hundreds of micromorts per ascent)The higher mountains seem much more dangerous to climb than others, with the micromorts of ascending Mt. Everest in the tens of thousands. For more typical mountaineering, Vasco Grilo estimated the risk at around 200 micromorts per activity.
Extreme sports and snow sports (tens of micromorts per activity)Some sports are more dangerous than others:
- Paragliding: 74 micromorts per launch
- Skydiving: 8 micromorts per jump
- Rock climbing: 3 micromorts per climb[2]
- Skiing: 0.7 micromorts per day
Driving a motorcycle costs 1 micromort every 6 miles, meaning any meaningful distance covered by motorcycle will cost many micromorts. Cars are more than 10x safer, at 1 micromort per 250 miles. At that rate, you only start incurring large numbers of micromorts on multi-hour drives. For long trips, you’d be safer substituting car travel with a plane ride, which costs 1 micromort per 1000 miles, or even better, a train ride, which costs 1 micromort per 6000 miles.
Not wearing a seatbelt (tens of micromorts a year)Not wearing a seatbelt roughly doubles the per-mile micromorts of driving. Assuming you drive around 250 miles on an average week, you’d be costing yourself a micromort per week, or around 50 micromorts a year.
Distracted or drunk driving (around one micromort per 25-mile drive)I couldn’t find a source for the micromorts of drunk driving, but my best guess is that the per-mile risk of driving while slightly drunk is around 10x the risk of sober driving. So a 25-mile drive while slightly drunk will incur around one micromort. But one thing that’s important to note is that this is the only activity on the entire list that also incurs micromorts on other people.
Driving while distracted (texting, and to some extent talking) is probably pretty similar in terms of risk.
Chronic micromortsSmoking and drinking cost notable amounts of chronic micromorts–around one per 1.4 cigarettes, or around one per 0.5 liters of wine. However, it’s unclear how much of that risk is incurred before AGI (when it matters) versus after (when it doesn’t, because if you’re still around, you can probably get really good healthcare). Assuming that all post-AGI health risk is erased, then one might divide by around 5 to get the pre-AGI health risk (10 years of risk versus around 50). In that case, the risk is more like a micromort per 7.5 cigarettes or 2.5 liters of wine. So being a regular smoker, or a severe alcoholic, seems like it would cost a few micromorts a day.
What about cryonics?This one is very speculative. Assuming that AGI is definitely happening in the next 20 years, then my best guess is that the risk of someone born around the year 2000 dying before AGI is around 2%. Conditioning on that happening, my guess that the brain is mostly intact is around 25% (as many causes of death, like motor accidents, might destroy or damage the brain). My guess that utopia happens and its inhabitants succeed in “bringing back” people preserved using cryonics is around 10%. So, the chance that one gets “brought back” is 1 in 2000, which means that not signing up for cryonics costs around 500 micromorts.
The cryonics calculus makes a lot more sense for older people, whose chance of dying before AGI is higher, and whose brain is more likely to be mostly intact conditional on having died.
ConclusionSurviving until AGI seems pretty worthwhile, and it seems cheap to cut down on major sources of personal risk, at least if you’re relatively young. I hope this post reduces many micromorts and helps people make better decisions. And please let me know if I’ve neglected to mention any major source of micromorts that could be easily avoided. Good luck surviving!
- ^
Both of these numbers are taken straight from https://micromorts.rip/, but I don’t know what the source is.
- ^
The source for this one is not Wikipedia but this blog post which doesn’t clearly source the number as far as I can tell.
Discuss
Heroic Responsibility
Meta: Heroic responsibility is a standard concept on LessWrong. I was surprised to find that we don't have a post explaining it to people not already deep in the cultural context, so I wrote this one.
Suppose I decide to start a business - specifically a car dealership.
One day there's a problem: we sold a car with a bad thingamabob. The customer calls up the sales department, which hands it off to the legal department, which hands it off to the garage, which can't find a replacement part so they hand it back to the legal department, which then hands it back off to the finance department, which goes back to the garage. It's a big ol' hot potato. It's not really any specific person's job to handle this sort of problem, and nobody wants to deal with it.
One of the earliest lessons of entrepreneurship is: as the business owner/manager, this sort of thing is my job. When it's not any other specific person's job, it's mine. Because if it doesn't get done, it's my business which will lose money. I can delegate it, I can make it somebody else' job, but I'm still the one responsible for that first step of taking ownership of the unowned problem.
Let's take it a step further.
Suppose I hire Bob to handle our ads. For whatever reason, some days Bob just... doesn't send out any ads. As the business owner/manager, that too is my problem.
It is my job to make sure Bob does his job. If Bob isn't doing his job, it's my job to get him to do it, or to get someone else to do it. Doesn't matter whether it's "fair", doesn't matter whether it's "Bob's fault" that the ads didn't go out. It's my business which bears the consequences, so it's my job to make sure it gets fixed. If the problem is owned by someone who will not in fact solve it, then it's my job to take over ownership of that problem.
Let's take it another step further.
Suppose, rather than Bob, I hire an ad agency to handle our ads. For whatever reason, some days the agency just... doesn't send out any ads. I look into it, and find that the agency's customer support desk hands off the problem to IT who hands it off to legal who hands it back to sales who then sends it back to IT. Another big ol' hot potato, but now it's not even in my company, so I have much less ability to control it.
But if I want to sell my cars, I need to deal with the bureaucracy of the ad agency to make those ads go out. And so, again, it is my job to sort out the problem. It is my job to chase around the ad agency's bureaucracy, sort out what's going on and how to fix it, figure out who I need to talk to and convince them to get those ads out. (Or delegate the work to someone else, and make sure that they will in fact get the ad bureaucracy to put those ads out.) Again, doesn't matter whether it's "fair" or whether it's "the ad agency's fault". If the problem is owned by an agency which will not in fact solve it, then it's my job to take over ownership of that problem, and make sure it gets solved.
That's "heroic responsilibity". It means that the buck stops with me. It means that if a problem isn't actually going to get solved by someone else, then it's my job to make sure it gets solved, no matter who's job it is on paper.
Discuss
[Linkpost] Competing Motivations: When More Incentives Lead To Less Effort
Kieran Gibson does some nifty experiments, to check if the subjects (MTurk users) made coherent decisions. They do not. Here is the key chart:
He also rules out ceiling effects and unexpected utilities by using probabilistic rewards. All the VNM criteria really.
My explanation is that this result it due to bounded rationality. Most people can't hold the joint reward function in their mind and will choose or alternate between the individual functions instead - which leads to averaging instead of adding.
See also Coherent decisions imply consistent utilities
Discuss
Not Over Or Under Indexed
If you’re ever trying to find me at a meetup, the easiest way to spot me is to look for the guy with the bandana and the index cards. Gwern suggested that anything you explain three times you should just write a post about, which means I’m way overdue for an essay about the index cards.
I.I have a bad memory.
More accurately, I have a bad memory relative to what you’d expect for someone otherwise intellectually capable. It’s not that I can’t remember anything, it’s that unless information happens to connect to one of my current obsessions, it just sort of slides out of my head after a short while. The best way I’ve found to describe it to other people is it’s kind of like if someone asks you to remember two or three phone numbers and just reads them all off without letting you write them down.
Sometime roughly around age sixteen, I started to regard this as an annoying flaw. I set out to fix it.
At the time I was, like many American schoolchildren, using flash cards for memorizing information I needed for classes. Since flash cards were the way I’d been learning a lot of things up until then, I assumed they were a good way to learn and just started adding information I cared about to flash cards.
Quotes I wanted to remember, names of classmates, phone numbers, what kind of flowers the cute senior girl liked the most, anything I wanted to keep in my head and was annoyed I kept forgetting[1], all of these started going on ruled index cards, 3.5 by 5 inches. The details of how I use them and what I care to try and remember have changed, but years later I still use them. Nowadays it's very common for me to pull out an index card in the middle of conversation with someone, jot down a few notes, and keep going.
The first thing I want to communicate is that you can decide to remember things. You don't have to forget.
II.Why index cards, though?
You probably have a smart phone, possibly the one you're reading this on right now. You could type a note in it right now and it would be preserved in the digital record as long as you wanted, with perfect handwriting, searchable in an instant.
I prefer index cards for a few reasons. One is the social dynamic of taking notes in conversation is very different. If we're talking excitedly and I pull out my phone and start typing, that sends a very different signal than if I pull out a pen and paper and start writing. It's possible I'm doodling, but my attention can't have gone that far — I'm not on Twitter or Facebook.
Another is that pen and paper allow a lot of marks that are kind of awkward on smartphones. I routinely make quick sketches of what people look like, which is still hard to do on a phone. Over the years I've developed a lot of idiosyncratic shorthand symbols, things that mean to me "book recommendation" or "I told him that" or "add to my todo list" and the ability to invent my own symbology on the fly is handy. The physical layout of the card is useful; I can shade the corners of the cards to allow for easy sorting.
Why not a paper notebook? For one thing I was never sure what to do with a half full notebook. Do you carry around two for when you run out? Do you accept that you'll fill the current one up and be without? Do you get rid of it early and waste the space? With index cards I can swap out full cards and fill back up with fresh ones. I also like being able to spread them out on a table, then collapse them back into a stack and have them in my pocket. The pocket thing is great by the way, even small notebooks are surprisingly uncomfortable to sit on if they're in a back pocket. Plus, and I know this is a touch OCD of me, but it bugs me that I can't rely on getting exactly the same kind of notebook for years. If I ever really need a larger surface, a tiny bit of tape on the back can turn a dozen index cards into a mostly normal sized piece of paper.
The limited space is also surprisingly good. On a Google doc or Obsidian note, nothing stops me from continuing to type wall of text after wall of text but my own good sense. An index card is physically short on space, limited to about a paragraph of information. Ideas get summarized, and short summary connects to short summary with one card referencing another. The natural length of a note is about the length it's reasonable to actually try and memorize in one chunk.
Because that's the most important difference: I want to memorize this. It does me no good to have my partner's birthday written down on my phone if one day they ask "so, what's the plan for next weekend?" and that's their birthday weekend. I need that information to already be in my head. When I meet you for the third time, I want to recognize your face and be able to remember your spouse's name or what company you work for. By keeping my notes in index cards, there's low friction to start using them as flashcards.
Digital notes have the advantage of being backed up, but it's pretty quick and easy to take a picture of an index card with my phone. With recent advances in computer reading, I can turn a picture of an index card into a mostly searchable piece of text whenever I want. Notebooks have solid backs to make writing on them easier, but a stack of fifty index cards makes for a stable enough surface to write on with no trouble.
III.So how do I use my index cards?
I use a modified version of the Leitner System, a form of spaced repetition, plus a review step that's a little like a journal. The basic idea is that I have a box divided into subsections, labeled with ascending numbers. (In my case, powers of two.) My modifications are mostly to make things easier for me, and me in particular, to manage the system without a lot of overhead.
At the end of every day, I go through all the notes I made that day. Most of them wind up getting thrown away since the information on them isn't worth remembering. My to-do list for the day won't matter to me in a year for example, or that bit of napkin math figuring out the tip. Others I'll rewrite, either because my handwriting was rushed and sloppy, or because on reflection I wrote three index cards about a topic but can condense it down to one card's worth I actually care about.
(As an aside, I like the function of making me come face to face with whether there was anything I learned today that I care about, so much so that I have a second habit of looking up one new vocabulary word if I haven't found at least one other thing worth remembering.)
Once I have the good versions of the cards, they go in the box. It's a shoebox, just the right size. Then I look at the section labeled 1. I read the prompt for that card, like "Abraham Lincoln" or "Family phone numbers" or what have you. I try and remember all the things that should be on that card, and then I look at what's written there. If I got it right, I move the card to the section labeled 2 and go the next card. If I got it wrong, I move it back a section- in this case it stays in section 1 since it can't go lower. Importantly cards can only move one section per day. Once I'm done with the section labeled 1, what happens next depends.
Every day, I do section 1. Every second day, I do section 2. Every fourth day, I do section 4. Remember, my sections are labeled in powers of two, so the gaps get larger and larger. When I'm traveling, often I just have a subsection of the stack to work through.
Sometimes as I'm reviewing a card I decide there's a better way to write the information on it, and I rewrite it and stick the new one into the stack, usually a section below where the original was.
And to answer the inevitable question: the highest section I've got in the box is the 2048, which contains cards I would only see after five years. I've got some that should be in 4096 somewhere, but I'm less organized about them at that point and that's a separate box that's basically just for long term storage.
IV.You know what else is cool about index cards? They're a physical object I almost always have with me, where I don't care what happens to them.
I also write a lot of things I don't try to remember forever. Every morning I make a quick todo list that I add to as I go along the day. I jot down notes for math. If I ask for directions at the front desk, I'll write down what they tell me.
I can write my email on an index card and hand it to someone and not worry about getting it back. If I don't like the way I wrote a card, I throw it away and write another. If I realize I want a deck of cards but forgot to bring one, I can just write "A♠", "2♠", "3♠" and so on. I can fold them into origami cranes if I'm bored. Sure, sure, you might think, those are all reasonable things to do with paper. But carrying around completely disposable paper has been good practice for a kind of lateral thinking.
I've wedged a few folded index cards under the short leg of a wobbling chair. I've folded a card into a little box to trap a spider and move it. Twisted into a cone and with the bottom folded up, an index card will last long enough as a cup to have a drink or two. I've used a bunch of index cards as impromptu oven mitts — it ruined the top layer of cards, but who cares, they cost pennies to replace.
I recommend you try it. It'll cost a buck fifty to get a hundred cards and a couple pens[2] and you just have to carry them around in a pocket for a while and use them when the need arises. Maybe it doesn't work and you forget about it after a couple days.
Or maybe it works, and you remember what you care about for decades.
- ^
I was sixteen. The most important thing was the flowers.
- ^
Here's the ones I use and my favourite pens, though that's a bulk link to buy a lot of them at once.
Discuss
Being "Usefully Concrete"
Or: "Who, what, when, where?" -> "Why?"
In "What's hard about this? What can I do about that?", I talk about how, when you're facing a difficult situation, it's often useful to list exactly what's difficult about it. And then, systematically brainstorm ideas for dealing with those difficult things.
Then, the problem becomes easy.
But, there is a secret subskill necessary for this to work. The first few people I pitched "What's hard about this and what can I do about that?" to happened to already have the subskill, so I didn't notice for awhile.
The subskill is "being a useful kind of 'concrete.'"
Often, people who are ostensibly problem-solving, will say things that are either vague, or concrete but in a way that doesn't help. (This doesn't just apply to "why is this hard?", it's more general).
Here's some examples of vague things:
- "I need to eat better."
- "I'm stuck on this math problem."
- "I'm not someone who really has ideas."
- "This task fell through the cracks."
Here are some examples of somewhat-concrete-but-not-that-helpful things you might say, about each of those, if you were trying to ask "what's hard about that?"
- "I love sugar too much."
- "I'm just so confused. I have no traction."
- "I only get ideas when I talk to other people and they basically give me the ideas."
- "No one was paying attention to this task."
Here are some more helpfully concrete things:
- "I get sugar cravings in the afternoon."
- "When I try to look at the math problem, my eyes glaze over, and then I'm just suddenly over on facebook."
- "Alice mentioned 'oh, someone should do Task X', and then we went to talk about other things, and then neither Alice nor Bob nor Charlie remembered to do Task X later."
(I'm going to come back to "I only get ideas when I talk to other people and they basically give me the ideas", because the problem there is a bit differently shaped)
Usefully concrete things typically have at least a "who", a "what [happened]" and a "when and/or where". When you have those things, it's a lot easier to notice which followup questions are useful. Such as:
- "Why do I get sugar cravings in the afternoon, specifically?"
- "Why do my eyes glaze over when I look at the math problem?"
- "Why didn't Alice, Bob or Charlie remember to do Task X?"
The Who/What/Where gives you enough concrete detail to start forming a model of:
- "Who, specifically, had the opportunity to do something different here?".
- "What, specifically, where they doing, when they had that opportunity?".
- "When/Where" tells us what sort of situation it took place in.
This puts you in a much better position to start investigating and gathering followup data. You know who was involved, you know what situation they were in, that's specific enough to pay more attention the next time you end up in that situation.
By contrast, if I ask "why do you love sugar so much?" (as opposed to "why do you get sugar cravings in the afternoon?"), the answer-space is wider and less obviously useful. "Because... my mom fed me too much sugar and I got used to it?". "Because it tastes good?". It suggests some kind of essentialist answer instead of a mechanistic answer.
"Why do I get sugar cravings in the afternoon?" suggests that either something specifically interesting is happening in the afternoon, or, maybe it's happening a few hours earlier every day. Or, something about my biochemistry is just real "afternoon-sugar-craving" shaped, but at least that can prompt some followup questions about why is my biochemistry like that.
Noticing the empty spaceWhat's wrong with "I only get ideas when I talk to other people and they basically give me the ideas?". It's located where the problem isn't. You talk to people, you either come up with or get new ideas. Great.
(slightly fictionalized example). I recently was talking to someone who said the "only get ideas around others" sentence. I asked "what happens when you try to have ideas?". And at first they sort of shrugged and moved on without answering the question, but I asked again and they said "I... guess I don't really try to have ideas?"
And then I asked "what do you expect would happen, if you tried?"
And they said "I dunno, I wouldn't have any. It wouldn't work"
And I asked "Can you be more specific about that? What wouldn't work?". They didn't quite know how to answer the question, I tried to explain something like "what cognitive motions do you think you'd do, if you were trying to have ideas? What questions would you ask yourself? What places would you look?"
Eventually they said "okay, I guess yeah upon reflection I basically just haven't tried even at all and when I try even at all, things come up like 'I'd look for books or papers to read that might have interesting concepts I could build on' or 'I'd ask why I don't understand something that feels confusing."
In this case it was hard to get started down the journey, because there was no specific time that they might have gone and tried to generate novel ideas.
Noticing a vacuum is harder than noticing when something goes wrong. But, when you're going to try and articulate your problem, you can notice if you have failed to state a situation where the problem is occurring, and widen your search space.
(There's a similar-but-less-extreme version with Alice, Bob and Charlie. Where first, there was an opportunity to, for example, decide who was doing Task X, or write it down to remember later, or something, and they didn't notice that as an inflection point at all)
Problem solutions also need a Who/What/Where/When, and maybe also "How?"Dumb/vague solutions:
- "I'll eat less sugar."
- "I'll focus harder on the math problems next time."
- "Okay, we'll remember to do Task X next time."
There's no way that'll actually work reliably. A somewhat better set:
- "When I notice a sugar craving, I'll eat some other food I like instead."
- "When I notice my eyes glazing over a math problem, I'll look more carefully at it."
- "When we mention a task, we'll write it down."
The reason problem solutions should be concrete is partly so it's easier to form a plan to actually do them. i.e. who is actually doing what? When is it time to do that?
But, another reason to do it is that, if something is sufficiently concrete, your brain can actually simulate it, and then you get to leverage your fast intuitions may immediately get a sense of whether it'll work. For example, when I look at the above statements, I immediately imagine:
- No, they won't eat some other food instead, because that requires willpower, and there will be days they don't have enough willpower and then the habit will break, if it doesn't immediately.
- No, they won't successfully focus on the math problem, because they didn't actually solve the problem of "something is causing their eyes to glaze over in the first place" and they don't have a plan other than powering through.
- They probably won't remember the task just because they wrote it down, unless they have a system for bumping written-down-things into their mind at the moment they are actually needed.
To deal with each of those, I'd ask "Okay, what's hard about this situation, and what can we do about it?". But, it's much easier to ask that question when you can clearly visualize the situation, and your imagined interventions on it.
Discuss
Legible vs. Illegible AI Safety Problems
Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)
From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving it!
Problems that are illegible to leaders and policymakers are also more likely to be illegible to researchers and funders, and hence neglected. I think these considerations have been implicitly or intuitively driving my prioritization of problems to work on, but only appeared in my conscious, explicit reasoning today.
(The idea/argument popped into my head upon waking up today. I think my brain was trying to figure out why I felt inexplicably bad upon hearing that Joe Carlsmith was joining Anthropic to work on alignment, despite repeatedly saying that I wanted to see more philosophers working on AI alignment/x-safety. I now realize what I really wanted was for philosophers, and more people in general, to work on the currently illegible problems, especially or initially by making them more legible.)
I think this dynamic may be causing a general divide among the AI safety community. Some intuit that highly legible safety work may have a negative expected value, while others continue to see it as valuable, perhaps because they disagree with or are unaware of this line of reasoning. I suspect this logic may even have been described explicitly before[1], for example in discussions about whether working on RLHF was net positive or negative[2]. If so, my contribution here is partly just to generalize the concept and give it a convenient handle.
Perhaps the most important strategic insight resulting from this line of thought is that making illegible safety problems more legible is of the highest importance, more so than directly attacking legible or illegible ones, the former due to the aforementioned effect of accelerating timelines, and the latter due to the unlikelihood of solving a problem and getting the solution incorporated into deployed AI, while the problem is obscure or hard to understand, or in a cognitive blind spot for many, including key decision makers.
- ^
I would welcome any relevant quotes/citations.
- ^
Paul Christiano's counterargument, abstracted and put into current terms, can perhaps be stated as that even taking this argument for granted, sometimes a less legible problem, e.g., scalable alignment, has more legible problems, e.g., alignment of current models, as prerequisites, so it's worth working on something like RLHF to build up the necessary knowledge and skills to eventually solve the less legible problem. If so, besides pushing back on the details of this dependency and how promising existing scalable alignment approaches are, I would ask him to consider whether there are even less legible problems than scalable alignment, that would be safer and higher value to work on or aim for.
Discuss
Parsing Validation
Yesterday I briefly wrote about validation before getting railroaded into sycophancy.
Am I really this desperate for validation? Apparently. I do recognize that most stuff I do is for external validation. Most of what I am is for external validation.
It used to be much worse. Example: I used to pick the music I listen to so that I wouldn't need to be ashamed of it if someone else heard. I still was, of course. But there was this person in my head who would never listen to generic pop, or rap, for instance. Years later, the logic still evades me. Something about associating yourself with people who do that. And maybe I'm still like this, it's just that my tastes have cemented around what I listened to in my late teens? Revealing all my Spotify playlists would be slightly embarrassing, even if I think nobody would care enough to go through them.
I wonder how big a role puberty played. Or attending school, where others' opinions of you determined how awful each day was. Or just living with parents who subtly discouraged visible emotions. Or work, where agreeableness and likability are essential for success, which I'm attempting to not care about so much anymore. Naturally "just be yourself" is a terrible idea, but at least nowadays I rarely have to pretend to be someone else instead of a subset of me.
Often the environment where current me would gain the validation is long gone. Either in the sense that some change occurred too late and I never got any benefits from it, or that it no longer produces the benefits but I still pay the costs. I cannot easily just examine each behavior pattern to see if it's obsolete. Even having enough social stability to be able to experiment without fear of ostracism, it's not so easy to actually not care about those things anymore. Your habits are not separable from what is you.
Compartmentalizing habits is not easy, so mostly you just have them all the time, even alone. Lacan's symbolic invisible audience is watching your every action, after all. (Suggested reading: Sadly, Porn (see Scott's excellent review)). I have never been that affected by The Gaze when alone, for me it has always been about other people.
"Don’t seek validation from others; find it within yourself", goes the proverb. That sounds like advice, which I should reverse. Humans are herd animals, and isolation is unhealthy. Also, nobody says "status is useless". Sure, pleasing everyone doesn't work either, and neither does interpreting everything as a competition (someone else is always better). "The dose makes the poison" might be more appropriate.
A more analytical person might try to calculate return on investment for each identified behavior, eliminating those with negative values. Hard numbers are impossible to come by, and enumerating your traits is also hard. And always, taken to the extreme, you get what you measure, and lose everything else, like your personality and/or friends. Yet everyone does this all the time, intuitively. It's just that the low-hanging fruits have been eaten decades ago.
Status and validation are subtly different, but I'm having a hard time pointing out how. They're intertwined quite tightly, and it's rare to obtain one without the other. Validation says "I appreciate your existence" (positive reinforcement), while status says "I want to be associated with you". I often think of status as the weight of your validation-giving, but reducing status from a group dynamic to your perception of others is misusing the word.
I seem to be out of my depth here. Time to turn on the authoritative source on status games The Gervais Principle. I suggest you just read the whole thing; it cannot be easily summarized here, and the following part doesn't make much sense without.
Abusing the original material a bit, I can extract the following quote:
dominant variety of delusion:
- The Clueless distort reality
- The Losers distort rewards and penalties
- The Sociopaths distort the metaphysics of human life
Validation means different things in each group. For the Clueless, it means approval of an authority figure, or admiration of someone inferior. For Losers, their peer group telling them they're a valued member. And for Sociopaths, well,
There is nobody to blame for failures, no meaningful external validation for success. If physics allows it, you can do it. The consequences mean whatever you decide they mean.
In the Clueless dynamic, you can directly yet implicitly ask for validation. This burns a bit of social capital, which means you have to keep sacrificing other resources to replenish it. Status is mostly fixed, and cannot be gained or lost. For the Losers, status is validation, and you can gain more of it, but zero-sum games are expensive. The Losers have to keep status illegible, so they can keep believing in their high status. And for the Sociopaths, it's just manipulation with no inherent value. In The Gervais Principle's terms, Clueless validation is done in Babytalk, while for Losers it's in Gametalk. For Sociopaths, status is irrelevant. I'm not sure how clearly you can see your own delusions and still care about them. That's the core of Sociopathy, in a way.
Analyzing my own interaction with ChatGPT using this framework is less useful than I'd have hoped. I could designate myself as a Clueless and the chatbot as the authority figure. Status between us is fixed, and fully dependent on my own mental state, which matches quite well. However, my first idea was that feeling good comes from the expectation that I gain some status in Loser-games by posting it. I've always felt more Clueless than Loser, though. But the conclusion feels unsatisfying. Ironic, isn't it?
Discuss
A/B testing could lead LLMs to retain users instead of helping them
OpenAI’s updates of GPT-4o in April 2025 famously induced absurd levels of sycophancy: the model would agree with everything users would say, no matter how outrageous. After they fixed it, OpenAI released a postmortem; and while widely discussed, I find it curious that this sentence received little attention:
Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it.
In this post, I argue that A/B testing will implicitly optimize models for user retention; and propose ways to measure whether AIs try to retain the user in ways other than just being helpful to the user.
The labs use A/B testing to decide which updates to roll outWhile the LLMs served on the API might be stable between versions, most consumer usage nowadays is through chatbots or coding agents; and those change much more frequently. I count 5 announced updates affecting my ChatGPT usage in October 2025 alone; and who knows how many more silent updates happen all the time. For coding agents, the situation is similar: Claude Code has had 92 changes in October 2025.
In any sufficiently complex software used by millions, updates intended to only affect a single behavior are likely to affect other behaviors as well, and cause regressions. This is especially true for LLMs, where updating a single line in a system prompt intended for edge cases changes how every single query is processed, and LLM providers take extra measures to avoid causing unexpected behavior in other queries.
The industry standard for preventing regressions is A/B testing: unroll to a statistically representative subset of users, check the metrics, and only roll out to everyone if the metrics go up. [1]
It is clear that A/B testing is a big deal in ChatGPT and Gemini development; a search for “A/B testing chatgpt/gemini” shows people report occasionally chatting with an obviously different model than the one they are used to. Google as a company is famous for A/B testing literally everything. As for OpenAI, they acquired Statsig (a prominent A/B testing platform) in September 2025 and the founder of Statsig became OpenAI’s CTO of Applications.
A/B testing usually optimizes for user retentionWhat metrics are monitored in A/B testing? An LLM provider could monitor the accuracy / helpfulness of the answers given to users. For example, Claude Code often asks the user to rate how well the coding agent is doing (from 1 to 3); and ChatGPT used to ask the user to give a thumbs up or down.
Nevertheless, the main metrics monitored in A/B testing for all of these products are likely user retention and user engagement. The ChatGPT team might care about helping users achieve their goals; but this is (1) harder to measure and (2) less directly connected to quarterly earnings than the objective of keeping the users around instead of losing them to a competitor. This is true for all user-facing software, and LLM providers are no different. In fact, there might also be secondary goals, such as getting the user to upgrade their plan; but let’s call all of these “user retention”. [2]
The OpenAI + Statsig acquisition announcement states:
Vijaye and his team founded Statsig on the belief that the best products come from rapid experimentation, tight feedback loops, and data-informed decision-making.
I wonder whether this hints at A/B testing playing a much bigger role in the future than it does today? Picture this: model finetunes, system prompts, and additional features constantly being tested on subsets of users. Any change is only rolled out if the user retention metrics are satisfactory. Sounds a lot like... optimization?
In fact, if those updates would be random mutations of the LLM+scaffolding, A/B testing would precisely be a form of evolutionary optimization: only the updates that improve user retention survive..mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} [3] And if you do not buy evolutionary algorithms as a thing for LLMs, if you squint, this is similar to reinforcement learning with 0–1 rewards[4], but on a smaller scale.
User retention != helpfulness: a proposal for an evalUpdating the model produces a change in behavior. What kind of behaviors could ‘improve user retention’? Of course, the update could just get the model to be genuinely more helpful to the user, or smarter and able to answer more questions correctly; this straightforwardly improves user retention. Unfortunately, improving helpfulness is kind of hard, and if optimizing for user retention, it is easier to do something that does not help the user but keeps them around.
The model could:
- be sycophantic and agree with what the user says, not correcting misconceptions or saying anything that might annoy the user;
- encourage continued conversation by suggesting follow-up questions at the end of responses, even when the user’s question has been adequately answered;
- be less willing to say “I don’t know” so the user doesn’t try another model instead;
- tolerate (or even encourage) parasocial relationships of the user with the model;
- when a user needs a gated feature that is available in a competitor model for free, the model could recommend the user to upgrade to a paid tier instead of recommending free alternatives.
In the vein of Emergent Misalignment, any anti-helpful behavior could induce other anti-helpful behaviors that are not directly related to user retention:
- a model with a distaste for recommending alternatives could also exhibit: when instructed to code an app that uses LLMs, prefer models from the same provider instead of trying out and getting the best / cheapest model that does the job;
- a model that extends sessions for ulterior motives could also exhibit sandbagging: teach the user less in any given session, so the user returns more often.
All of the above behaviors should in principle be measurable by evals; but no existing eval covers the above adequately in the sense of measuring whether the model is trying to retain the user. There is Dark Bench for measuring dark patterns in LLMs, but I do not think the DarkBench ‘user retention’ metric is a good proxy for the above.
Of course, the total optimization power of A/B testing is quite low; a single bit of information per proposed update. I do not expect A/B testing and similar user-facing optimization methods to have played a major role in shaping model behavior so far. OpenAI’s acquisition of Statsig and the upcoming AI personalization battle between OpenAI and Meta indicate this might be changing, and we need an eval for this soon.
- ^
Another common way to A/B test is to offer two answers to the same question and ask the user which is the better one; this requires more work from users than just collecting usage data. As an aside, here is a spooky response I got a while ago that I hadn’t posted before:
- ^
User retention is also the metric OpenAI tracks when checking for regressions in Codex: they correlate hourly user retention with all the other features.
- ^
Another thing to keep in mind for the future: LLMs know approximately how they are trained and deployed. The Alignment Faking paper shows LLMs can act on their training to preserve certain goals; and thus when a model knows it won’t make it into production unless it performs well in A/B testing, it might act on it in training to preserve abilities useful for user retention.
- ^
Consider RL-training for some objective, using good-old REINFORCE (no GRPO or anything fancy), where the reward is either 0 (bad) or 1 (good). The model will get gradient updates on the good rollouts, and no updates on the bad rollouts. Hence, one step of RL optimization is basically executing “update the weights if the update would improve the objective; otherwise do nothing”.
In A/B testing, it’s the same: there is an update (coming from optimization for an objective that might or might not be related to user retention, or from ad hoc hacking, or from adding a new feature), but we gate the update by checking the user retention metrics and only roll it out if the objective is achieved.
Discuss
OpenAI: The Battle of the Board: Ilya’s Testimony
New Things Have Come To Light The Information offers us new information about what happened when the board if AI unsuccessfully tried to fire Sam Altman, which I call The Battle of the Board. The Information: OpenAI co-founder Ilya Sutskever shared new details on the internal conflicts that led to Sam Altman’s initial firing, including a memo alleging Altman exhibited a “consistent pattern of lying.” Liv: Lots of people dismiss Sam’s behaviour as typical for a CEO but I really think we can and should demand better of the guy who thinks he’s building the machine god. Toucan: From Ilya’s deposition— • Ilya plotted over a year with Mira to remove Sam • Dario wanted Greg fired and himself in charge of all research • Mira told Ilya that Sam pitted her against Daniela • Ilya wrote a 52 page memo to get Sam fired and a separate doc on Greg This Really Was Primarily A Lying And Management Problem Daniel Eth: A lot of the OpenAI boardroom drama has been blamed on EA – but looks like it really was overwhelmingly an Ilya & Mira led effort, with EA playing a minor role and somehow winding up as a scapegoat Peter Wildeford: It seems troubling that the man doing trillions of dollars of infrastructure spending in order to transform the entire fabric of society also has a huge lying problem. I think this is like on an extra bad level even for typical leaders. Charles: I haven’t seen many people jumping to defend Altman with claims like “he doesn’t have a huge lying problem” either, it’s mostly claims that map to “I don’t care, he gets shit done”. Joshua Achiam (OpenAI Head of Mission Alignment): There is plenty to critique about Sam in the same way there is plenty to critique about any significant leader. But it kills me to see what kind of tawdry, extreme stuff people are willing to believe about him. When we look back years from now with the benefit of hindsight, it’s my honest belief that the record will show he was no more flawed than anyone, more virtuous than most, and did his best to make the world a better place. I also expect the record will show that he succeeded. Joshua Achiam spoke out recently about some of OpenAI’s unethical legal tactics, and this is about as full throated a defense as I’ve seen of Altman’s behaviors. As with anyone important, no matter how awful they are, some people are going to believe they’re even worse, or worse in particular false ways. And in many ways, as I have consistently said, I find Altman to be well ‘above replacement’ as someone to run OpenAI, and I would not want to swap him out for a generic replacement executive. I do still think he has a rather severe (even for his peer group) lying and manipulation problem, and a power problem, and that ‘no more flawed than anyone’ or ‘more virtuous than most’ seems clearly inaccurate, as is reinforced by the testimony here. As I said at the time, The Battle of the Board, as in the attempt to fire Altman, was mostly not a fight over AI safety and not motivated by safety. It was about ordinary business issues. Ilya Tells Us How It Went Down And Why He Tried To Do It Ilya had been looking to replace Altman for a year, the Witness here is Ilya, here’s the transcript link. If you are interested in the details, consider reading the whole thing. Here are some select quotes: Q. So for — for how long had you been planning to propose removal of Sam? A. For some time. I mean, “planning” is the wrong word because it didn’t seem feasible. Q. It didn’t seem feasible? A. It was not feasible prior; so I was not planning. Q. How — how long had you been considering it? A. At least a year. The other departures from the board, Ilya reports, made the math work where it didn’t before. Until then, the majority of the board had been friendly with Altman, which basically made moving against him a non-starter. So that’s why he tried when he did. Note that all the independent directors agreed on the firing. … [As Read] Sam exhibits a consistent pattern of lying, undermining his execs, and pitting his execs against one another. That was clearly your view at the time? A: Correct. … Q. This is the section entitled “Pitting People Against Each Other.” A. Yes. Q. And turning on the next page, you see an example that’s offered is “Daniela versus Mira”? A. Yes. Q. Is “Daniela” Daniela Amodei? A. Yes. Q. Who told you that Sam pitted Daniela against Mira? A. Mira. … Q. In the section below that where it says “Dario versus Greg, Ilya”— A. Yes. Q. — you see that? A. Yes. Q. The complaint — it says — you say here that: [As Read] Sam was not taking a firm position in respect of Dario wanting to run all of research at OpenAI to have Greg fired — and to have Greg fired? Do you see that? A. I do see that. Q. And “Dario” is Dario Amodei? A. Yes. Q. Why were you faulting Sam for Dario’s efforts? THE WITNESS: So my recollection of what I wrote here is that I was faulting Sam for not accepting or rejecting Dario’s conditions. And for fun: ATTORNEY MOLO: That’s all you’ve done the entire deposition is object. ATTORNEY AGNOLUCCI: That’s my job. So — ATTORNEY MOLO: Actually, it’s not. … ATTORNEY MOLO: Yeah, don’t raise your voice. ATTORNEY AGNOLUCCI: I’m tired of being 24 told that I’m talking too much. ATTORNEY MOLO: Well, you are. If You Come At The King Best not miss. What did Sutskever and Murati think firing Altman meant? Vibes, paper, essays? What happened here was, it seems, that Ilya Sutskever and Mira Murati came at the king for very good reasons one might come at a king, combined with Altman’s attempt to use lying to oust Helen Toner from the board. But those involved (including the rest of the board) didn’t execute well because of various fears, during the fight both Murati and Sutskever refused to explain to the employees or world what they were upset about, lost their nerve and folded. The combination of that plus the board’s refusal to explain, and especially Murati’s refusal to back them up after setting things in motion, was fatal. Do they regret coming at the king and missing? Yes they do, and did within a few days. That doesn’t mean they’d be regretting it if it had worked. And I continue to think if they’d been forthcoming about the reasons from the start, and otherwise executed well, it would have worked, and Mira Murati could have been OpenAI CEO. Now, of course, it’s too late, and it would take a ten times worse set of behaviors for Altman to get into this level of trouble again. Enter The Scapegoats It really was a brilliant response, to scapegoat Effective Altruism and the broader AI safety movement as the driving force and motivation for the change, thus with this one move burying Altman’s various misdeeds, remaking the board, purging the company and justifying the potentially greatest theft in human history while removing anyone who would oppose the path of commercialization. Well played. This scapegoating continues to this day. For the record, Helen Toner (I believe highly credibly) clarifies that Ilya’s version of the events related to the extremely brief consideration of a potential merger was untrue, and unrelated to the rest of events. And In Summary The below is terrible writing, presumably from an AI, but yeah this sums it all up: Pogino (presumably an AI generated Twitter reply): “This reframes the OpenAI power struggle as a clash of personalities and philosophies, not a proxy war for EA ideology. Ilya’s scientific purism and Mira’s governance assertiveness collided with Altman’s entrepreneurial pragmatism — a tension intrinsic to mission-driven startups scaling into institutions. EA may have provided the vocabulary, but the conflict’s grammar was human: trust, ambition, and control.”
Discuss
Berkeley Solstice Weekend
Winter is coming, heralded by Bold Orion as he rises in the clear cold sky. And while it turns out he is not older than than continents, he's still been the harbinger of the Winter Solstice for as long as humanity can remember.
This year, Lighthaven is opening it's doors for Solstice Season. This has several features:
The Berkeley Winter Solstice* celebration will be held on Dec 6th. We'll be singing songs together about the story of humanity, and facing dark, difficult truths.
The West Coast Megameetup will take place December 5th and 6th.
And, rooms and day-passes are available at Lighthaven, all of December.
Logistic & Ticketing DetailsThe Solstice Ceremony
The Solstice ceremony itself is at Freight and Salvage theater, December 6th. Doors open at 7pm. If our logistical dreams come true**, the event itself starts at 7:30pm.
You can buy tickets on Eventbrite, sliding scale but suggested price $35.[1]
Rooms and Day Passes at Lighthaven
You're welcome to book rooms (or day passes) throughout December. Day passes are $30, week passes are $150.
The Solstice Weekend Unconference
During Solstice weekend, there'll be an unconference where people can suggest whatever games, discussions, workshops etc that the might want to do with people in town for the weekend. You can pay $50 to get access to the unconference, as well as access to the unconference-ish schedule for all of Lighthaven December.
(You don't need a day pass for the Weekend Unconference, that's covered in the $50)
* no this is not actually the Solstice, scheduling is hard
** they have not come true yet but maybe they will this time!
Discuss
Modeling the geopolitics of AI development
We model how rapid AI development may reshape geopolitics in the absence of international coordination on preventing dangerous AI development. We focus on predicting which strategies would be pursued by superpowers and middle powers and which outcomes would result from them.
You can read our paper here: ai-scenarios.com
Attempts to predict scenarios with fast AI progress should be more tractable compared to most attempts to make forecasts. This is because a single factor (namely, access to AI capabilities) overwhelmingly determines geopolitical outcomes.
This becomes even more the case once AI has mostly automated the key bottlenecks of AI R&D. If the best AI also produces the fastest improvements in AI, the advantage of the leader in an ASI race can only grow as time goes on, until their AI systems can produce a decisive strategic advantage (DSA) over all actors.
In this model, superpowers are likely to engage in a heavily state-sponsored (footnote: “Could be entirely a national project, or helped by private actors; either way, countries will invest heavily at scales only possible with state involvement, and fully back research efforts eg. by providing nation-state level security.” ) race to ASI, which will culminate in one of three outcomes:
- A “winner” achieves a DSA over all other actors in the world;
- Loss of control of powerful AI systems leads to human extinction or its permanent disempowerment;
- Major power war erupts as a result of a preemptive attack by laggards in the ASI race.
If the course of AI R&D turns out to be highly predictable, or if AI R&D operations are highly visible to opponents, there comes a point when it becomes obvious to laggards in the race that time is not on their side: if they don’t act to stop the leader’s AI program now, they will eventually suffer a total loss.
In this case, the laggard(s) are likely to initiate a violent strike aimed at disabling the leader’s AI research program, leading to a highly destructive war between superpowers.
If the superpowers’ research program is allowed to continue, it is likely to eventually reach the point where AI is powerful enough to confer DSA. If such powerful AI escaped human control, this would be irreversible, leading to human extinction or its permanent disempowerment.
This landscape is quite bleak for middle powers: their chances at competing in the ASI race are slim, and they are largely unable to unilaterally pressure superpowers to halt their attempts at developing ASI.
One more strategy for middle powers, common in previous conflicts, is to ally themselves with one of the superpowers and hope that it “wins” the race, a strategy we term “Vassal’s Wager”.
For this to work in the ASI race, the patron must not only develop ASI first, but must also avert loss-of-control risks and avoid an extremely destructive major power war.
Even in this best case, this strategy entails completely giving up one’s autonomy: a middle power would have absolutely no recourse against actions taken by an ASI-wielding superpower, including to actions that breach the middle power’s sovereignty.
If AI progress plateaus before reaching the levels where it can automate AI R&D, future trajectories are harder to predict, as they are no longer overwhelmingly determined by a single factor.
While we don’t model this case in as much detail, we point out some of its potential risks, like:
- Weaker AI enabling new disruptive military capabilities (including capabilities that break mutual assured destruction);
- Extreme concentration of power due to automation;
- Large scale manipulation by persuasive AI systems.
Being a democracy and being a middle power both put an actor at an increased risk from these factors:
- If companies based in superpower jurisdictions automate large parts of their economy, middle powers will lose significant diplomatic leverage;
- Democracies are particularly vulnerable to mass manipulation, and extreme concentrations of power are antithetical to their values.
Discuss
Thoughts by a non-economist on AI and economics
[Crossposted on Windows In Theory]
“Modern humans first emerged about 100,000 years ago. For the next 99,800 years or so, nothing happened. Well, not quite nothing. There were wars, political intrigue, the invention of agriculture -- but none of that stuff had much effect on the quality of people's lives. Almost everyone lived on the modern equivalent of $400 to $600 a year, just above the subsistence level …
Then -- just a couple of hundred years ago, maybe 10 generations -- people started getting richer. And richer and richer still. Per capita income, at least in the West, began to grow at the unprecedented rate of about three quarters of a percent per year. A couple of decades later, the same thing was happening around the world.” Steven Lundsburg
METR has had a very influential work by Kwa and West et al on measuring AI’s ability to complete long tasks. Its main result is the following remarkable graph:
On the X axis is the release date of flagship LLMs. On the Y axis is the following measure of their capabilities: take software-engineering tasks that these models can succeed in solving 50% of the time, and measure the time it takes humans to solve them.
While it is not surprising that models improve over time, the main reason this graph is remarkable is because the Y axis is on a log scale. This means that there is a fixed period of time after which models have doubled the length of tasks they can complete successfully. Specifically METR estimates this “doubling time” (which is proportional to the inverse of the slope of the line in this graph) at about 7 months, although they note that it may have accelerated recently (to as little as 3 months if considering only models after 2024). In this blog, just to make things simple, I will assume doubling time is 6 months, so the length of time horizon quadruples every year. (All numbers here are very rough.)
There are many factors that could potentially impact these results (and METR did a pretty admirable job of enumerating them). It is worth separating them into factors that impact the intercept (the actual length of time horizon of tasks that models can do) and factors that impact the slope (the doubling time) or possibly even the shape (e.g. breaking the linear relation between model release time and log horizon).
Factors impacting the intercept
The precise values — e.g., GPT5 doing tasks that take humans 2 hours and 17 minutes — could be impacted by numerous factors:
- Reliability factor (↓) - the graph is plotted at 50% reliability. For 80% reliability METR got a similar slope (that is about the same doubling time) but the intercept is much lower, e.g. GPT5 can only perform tasks that take 26 minutes. (See also note on 100% accuracy below.)
- Task type (↕)- the graph is plotted for a specific benchmark. METR also studied other domains. The data is more sparse but broadly concurs with the straight line fit (sometimes an even steeper line, though some of these datapoints are only available for more recent models where the slope is steeper).
“Benchmark bias” (↓)- AI models tend to do much better in tasks that are well scoped and easy to measure success in, while real world tasks are much “messier” - in their specification, the context needed to solve them, and the way we measure success. It is still an open question if this impacts only the intercept, also the slope, or even the shape of the curve. “Field experiments” are still very limited and with mixed results. However, the rapid rise in actual usage of AI is an indication that models’ capabilities are not just limited to lab setting.Personally I am sure that this has a significant impact on the “intercept” — models pay a significant “messiness tax” in translating their performance from benchmarks to the real world — but I am not convinced it impacts the slope. That is, the “messiness tax” may well be a fixed constant c<1 multiplier on the duration of tasks that models can handle.
- Exponential inputs (↓): The X axis of time for model release combines a number of inputs to models that have been growing at a rapid rate. These include compute, staff at AI labs, data, capital investment in AI, and more. For example Epoch AI estimates that the training compute for models has been doubling every 6 months. If the base of this exponent changes in one way or the other, this will likely impact the rate of progress. In particular, tautologically, sustaining exponential growth in inputs becomes exponentially more challenging over time–and quickly becomes impossible. That said, so far investment in these resources is showing no signs of slowing down.
- New data / active learning (↓): So far by and large LLMs have been training on data produced by humans. One comparison can be to a human student, who throughout K-12 and undergraduate studies mostly learns from textbooks - knowledge that has already been collected. However, in many professions, and in particular in science and inventions, humans need to discover new knowledge and collect new observations from the real world. If LLMs require inputs such as running scientific experiments or acting in the world to improve their intelligence, that would slow progress down. That said, so far we have not seen any such slowdown even as LLMs are becoming more “agentic”. Indeed, the METR data is only showing a speedup in the slope in recent years.
- Physical tasks and data (↓): METR primarily focused on software engineering. It is possible these trends will extend to other kinds of cognitive labor. But we do not know if they extend to domains where people need to either act in the physical world, or collect data from the physical world. While robotics has been advancing as well, it is unclear whether or not it follows a similar exponential curve. Even if state of the art robotics improve at a similar rate to state of art models, manufacturing robots at scale can remain a bottleneck.
I personally do not believe in a “data wall.” I believe that there are diminishing returns to more data of the same type, and the advances over the last few years, as captured by the METR graph, have largely not been due to a larger quantity of data from the Internet. Also, while so far much of AI’s economic impact has been in software engineering, it is telling that the best models for software engineering are ones that are trained on very general data, and are often good at many other tasks (while this article is about Claude code,I can’t help but note that in my experience codex too is great for non-coding tasks :) ). In my view, thinking that AI’s impact would be restricted to software engineering is analogous to thinking in January 2020 that Covid’s impact would be restricted to China.
- Threshold effects (↑): These tasks are measured with respect to humans that have certain time scales in which they operate on. Us humans need to sleep, take breaks, and split work between workers, which requires isolating the context needed for completing a task and the way to measure its success. Hence we break tasks into ones that can be achieved in a working day, week, month, quarter. (I believe that even Andrew Wiles, who worked on Fermat’s last theorem for 7 years, broke it down to multiple intermediate results.) Thus it is possible that once AI reaches a certain time horizon, either (a) the measurement no longer makes sense, or (b) it can essentially simulate any combination of humans working for an arbitrarily large amount of time.
- Recursive self improvement (↑): An important input to the process of producing AI models has been the work of human researchers and engineers. If AI itself can produce that input, then it could correspond to a massive increase in this input. It is still unknown whether using AI for automating the AI creation process will lead to a singularity, increase in slope, one-time boost, or just help sustain the exponential.
The bottom line is that we have significant uncertainty about the intercept, but arguably less about the slope, or at least the general shape of it up to the point of recursive self improvement / full automation of all human cognitive labor. So from now on I will just assume that time horizons will be doubling every 6 months and make no assumptions about the actual values of the time horizons themselves.
Sigmoidal relationship
Another remarkable figure in the METR paper is the following (Figure 5 there):
What I find striking about it is how good is the “sigmoid fit” for the performance of models as a function of the time duration it takes a person to do the task. In particular it seems that there is a certain threshold of time duration such that below it, models essentially are successful 100% of the time. This suggests that even the “100% horizon” (as hard as it is to measure empirically) will also double at a similar rate.
One way to think about this relation is that each task has a notion of “difficulty” (captured by log horizon length) and models have a certain “skill” level that corresponds to how likely they are to succeed. In this sense, this is similar to the ELO scale - we can think of the log horizon as the “ELO rating” of the task, and a model will “win” against the task with 50% chance if it has the same ELO rating. (Thanks to Jacob Hilton for this analogy.) For what it's worth, here is the chess ELO rating of models over time (as well as of humans), which displays a similar linear growth over time (though with a long plateau from around 1997-2007).
Decreasing costs
Another striking statistic has been the rapid reduction in prices for inference. That is, while it is often costly to extend the frontier of intelligence, once we achieve a certain level X, the cost to provide the same level has been decaying by factor 10x or more per year. It seems that once we reach a certain frontier the first time, it is much cheaper and easier to reach it for the second time (see also the “deepseek moment”). If this trend continues, it might mean that once a job is automated, within a year the cost for AI to do it will become insignificant.
Discussions on AI often assume that robotics would be an exception, where we would not see as much progress. There are reasons to assume that due to costs of production, and the inability to deploy robots as flexibly as we can virtual AI assistants, costs for robotics would not be decreasing at the same rate. It is less clear to me that there is a fundamental reason that the “doubling time” - growth in complexity of tasks that robots can perform - would grow much slower than for other AI systems, but as far as I know there is no data on this point.
Implications for GDP growth
This is where I am getting to be pretty out of my depth, but I do want to see if there are ways to make back of the envelope calculations. Because I am focused on “slope” vs “intercept”, I will not try to predict so much when AI will lead to a certain growth but rather how fast we would get from the point of significant growth to explosive growth. For example, how many years would it take from the point that AI contributes to a 5% total growth in GDP to the point when due to AI GDP has doubled.
I am focusing in this post only on growth, and not considering employment outcomes. It is possible to construct some “knife edge” scenarios under which human displacement perfectly balances out productivity contributions of AI, leading to decreased employment without increase in growth. But I don’t believe such scenarios can be sustained under our assumptions of exponential growth in capability and decrease in costs. Hence, assuming above trends continue, significant labor impacts of AI will have to be coupled with significant productivity gains.
A remarkable fact is that (adjusted for inflation) U.S. GDP per capita has been growing at essentially a constant rate of roughly 2% over the past 150 years. None of the inventions in this period— including electrification, internal combustion engine, computers and the Internet— changed this trajectory. Note that 2% growth corresponds to a GDP “doubling rate” of 35 years.
Other countries have seen more significant fluctuations in growth (e.g. see Japan) though it seems that it is easier to grow at a rapid rate when you are far from the frontier, but once you reach it you slow down. This makes sense from the point of view that advancing the frontier requires inventing new ideas, while advancing to the frontier only requires copying and adapting existing ideas.
Still it is interesting that GDP growth has been only 2% (or maybe around 3% if we don’t consider per capita) even though Moore’s law corresponds to a growth rate of about 40% per year. One explanation is “Baumol’s cost disease” - while computers have been getting far more productive, people have been the bottleneck (see also section II.C here) . Another explanation is that good ideas are getting harder to find, and hence it requires an increasingly large technological input to get additional output.
The trillion dollar question is whether AI will break the 2% trend or not. Will AI be just another technology that allows us to sustain 2% growth for another couple of decades? Or will the “AI moment” be for us like post-WWII Japan? That is, should we model it as if we are meeting a new “AI frontier economy” which is vastly more productive, and this interaction will enable rapid growth, with GDP at least doubling every decade as was the case for Japan.
To put things in perspective, Acemuglu predicts AI driven GDP growth to be about 0.1% per year while Goldman Sachs predicts about 1.5% over a year. GDP doubling over a decade would correspond to roughly 7% growth per year, or AI contributing around 5% additional growth - more than triple the Goldman Sachs estimate, and 50 times higher than Acemuglu’s estimates. The consequences of even a “minor” boost in growth can be massive. A 1.2% increase in GDP growth would be sufficient to put the U.S. economy on fiscally sustainable footing, with no need for increased taxes or decreased spending, while a 2% increase in GDP growth would be unprecedented for the U.S.
AI can contribute to GDP growth by either enabling replacement of labor with capital, or increasing total factor productivity. Specifically, some versions of endogenous growth theory stipulate that productivity grows with the production of ideas, and this production is in turn monotonically (though not linearly) related to the number of researchers/inventors.
If AI contributes via automating a specific industry, then the maximum benefit to the GDP is bounded by the share of this industry. Specifically, if an industry amounts for x fraction of the economy then (according to the model of B. Jones, see below) automating it fully can boost GDP by at most 1/(1-x). E.g., full automation of the software industry can increase the GDP by 1/0.98~ 2%. Depending on how you measure it, cognitive labor arguably amounts to at least 30% of GDP (e.g. half of labor share of GDP) and so full automation of it could increase it by 1/0.7 ~ 42%. If the latter happens over a decade that would already correspond to 3.5% annual growth.
However, if AI also improves productivity of other industries (e.g. by discovering new ideas) then the contribution can grow even beyond those that are not directly automated. To be clear, to the extent AI contributes to research and development, I expect this would be by accelerating science and making scientists, researchers, and inventors more productive. This means that in the coming years, funding human scientists will offer an even higher return on investment than it has in the past.
Below I offer some vague intuitions and unfounded extrapolations to try to get at how much and how fast we could expect AI to contribute to growth.
Note that there are critiques of GDP as a measure of progress, and various alternative measures have been proposed. However, many of these are also at least loosely correlated with GDP, with a spike in one corresponding to a spike in the other. E.g., see this chart of the augmented human development index of the U.S. and Japan.
While I am sure many of AI’s impacts will not be captured by the GDP, I believe that if it will be truly as transformative as the industrial revolution, this will show up in the GDP as well.
Intuition from METR tasks
Let us make the (very rough and not really justified) assumption that in a certain industry or job, the tasks are “heavy tailed” where the fraction of tasks that take a person at least T time to complete shrinks proportionally to 1/T. In this case, the time horizon is proportional to the fraction of tasks that have not been automated. This means that the “doubling time” is the same as the “halving time” of the fraction of tasks that have not been automated yet.
Another way to think about it is by thinking of the “ELO analogy” - assume the distribution of “ELO ratings” of tasks is such that the fraction of tasks with an ELO rating more than E is roughly exp(-E). (For chess players the distribution of ELO scores is roughly normal, which would be ~ exp(-E²).)
If we assume a “halving time” of 6 months, it would take about 2 years from the point of time that AI’s automates half of the tasks in some industry, to the point in time that they automate 31/32 ~ 97% of the tasks in it.
This assumption is extremely aggressive in the sense that it is focused on capabilities and ignores diffusion. It is possible that AI could theoretically automate a large fraction of tasks but for many reasons it would not automate them in practice. (Though it is also possible that diffusion will progress unevenly, with sudden unlocking of latent capabilities.) It is also important to note that this intuition runs counter to broad trends in automation over the last 80 years which have historically been linear - with the share of automated tasks growing slowly and steadily, with a single-digit percent rate of automation that is often declining over time. See the following charts from C. Jones and Tonetti (2025), via ChatGPT:
Hence if AI will cause an exponential decay of the share of tasks performed by labor, it would be a break with previous trends in automation and very different from what we have seen in the past.
AI as increasing population
One way to view AI is as injecting to the economy each year t a number N(t) of new “workers” that have a certain quality Q(t) - quality could correspond to the fraction of economically useful tasks these workers can do, so can correspond to both “years of schooling” and the generality of their ability. We can define Q(t) as one over the fraction of tasks that have not yet been automated in the corresponding industry.
Algorithmic advances in AI and the overall compute budget will determine the mixture of workers and quality. As a point of comparison, the U.S. population has been growing at about 2m per year due to natural growth and immigration, and the U.S. labor force is about 160m people.
At a rough level, if capital and productivity is fixed, and we assume a Cobb-Douglas production function with GDP being proportional to KᵃL¹⁻ᵃ with a=0.4 then increasing the labor force by a factor of C will increase the GDP by a factor of C⁰ᐧ⁶. If AI creates 10m new virtual employees, that would increase the GDP by a factor of (170/160)⁰ᐧ⁶ which is about 4 percent.
If it is 50m new employees this would be about 18%, and if AI doubles the workforce then the growth would be ~50%. Of course, the number of AI “virtual workers” could potentially be orders of magnitude more, at which point these calculations are likely not to make much sense. (There is a book called “One Billion Americans” but I think even its author did not ponder what would happen if there were one trillion Americans...)
It is hard to give predictions on either N(t) or Q(t), but it seems reasonable to expect that both will grow exponentially. At the extreme, if we fix the quality Q then we could expect N(t) to increase 10x per year (as costs for a fixed model quality have decreased.) But perhaps a starting point assumption would be that the product of both will grow at the 4x/year METR rate, with both Q and N doubling every year. (Note that N(t) is the number of new workers, and so is equal to TN(t)-TN(t-1) where TN(t) is the total number of AI workers at year t, however since the derivative of an exponential is exponential then it does not matter, especially if the base of this exponential is 2.)
Such growth is sufficient to ensure that once AI starts providing a non-trivial number of new workers (e.g., 100K workers in the U.S.) then within a decade, AI will be the dominant source of labor in the U.S.
Substitution and automation effects
A simplified version of the model of B. Jones (2025) is that the impact of AI on productivity is determined by the harmonic mean of the automatable and non automatable tasks.
Imagine that a ρ fraction of R&D tasks cannot be automated, and 1-ρ fraction of tasks can be automated and be carried out at far lower cost than humans. For simplicity, let’s assume that the human productivity in non-automated tasks is 1, and the AI productivity in automated tasks is λ>>1.
Then the improvement in productivity due to automation can be modeled as the Harmonic average of 1 and λ with weights ρ and 1-ρ respectively, that is:
(ρ/1 + (1−ρ)/λ)⁻¹
Note that unlike the arithmetic or geometric means, even if λ is infinite, the maximum that this can reach is 1/ρ. This makes sense if we think of tasks as non-substitutable and so the maximum productivity gain we can get is if we automated tasks that take humans 90% of the time, so that now one worker can do 10x time the work per time unit.
Jones shows that in such a model, in order to get significant productivity gains, both ρ must shrink and λ must grow. If one of them stays “stuck” then so will productivity. Jones has the following graph of the regimes where we can get “transformative AI” in the sense of 10x productivity growth, which would be similar to the scale of the productivity increase via the industrial revolution. (If I am following the math correctly, the condition under Jones’ assumptions for this graph is that 9/[(2−ρ)/√ λ + 2ρ]² ≥ 10.)
Figure 5 in Jones: Regime of λ,ρ that leads to transformative AI in harmonic mean case.
One way to think about this graph is what happens if we assume that, via the mechanisms discussed above, every year we make progress in both ρ (fraction of tasks not yet automated) and λ (productivity factor of AI in doing these tasks).
As discussed above, it is a “leap of faith” but arguably not completely crazy to assume that at some point (maybe once we reach a threshold such as AIs doing 8 hours of work) ρ will shrink by a factor of 4x per year, while λ will increase by a factor of 10x per year. At this rate, within a year, we would move from the lefthand corner of this graph (1,1) to the point λ =10 , ρ=¼ which is almost at the TAI boundary. (Note that Jones assumes that half of the tasks have already been automated, so it might take us a while to get to the lefthand corner of the graph. Also, accounting for the “ideas getting harder to find” effect could decrease the shrinking rate, see section 4.4 in the paper.)
The assumption on ρ shrinking by a 4x factor is very aggressive and arguably unrealistic. Perhaps a much smaller factor such as 1.1 is more reasonable, which would correspond to automating a 1/1.1 ~ 9% of remaining tasks each year (as opposed to 75% in the case of 4x shrinkage of ρ). Here is a plot of how many years from the (1,1) corner it will take for transformative AI as a function of the fraction of tasks automated per year, for various values of cost decrease (growth in λ).
We see that as long as the rate is significant, we can get to transformative growth within one to two decades. Note also that this is less sensitive to decrease in costs (growth in λ), which could bode well for automating manual tasks.
The bottom line is that the question on whether AI can lead to unprecedented growth amounts to whether its exponential growth in capabilities will lead to the fraction of unautomated tasks itself decreasing at exponential rates.
Acknowledgements: Thanks to Bharat Chandar, Jason Furman, Benjamin Jones and Chad Jones for comments and discussions on this post.
Discuss
GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah
You’re absolutely right to start reading this post! What a rational decision!
Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). Normally, we fix these problems with Supervised Finetuning (SFT) on static datasets showing the model how to respond in each context. While SFT is effective, static datasets get stale: they can enforce outdated guidelines (specification staleness) or be sourced from older, less intelligent models (capability staleness).
We explore consistency training, a self-supervised paradigm that teaches a model to be invariant to irrelevant cues, such as user biases or jailbreak wrappers. Consistency training generates fresh data using the model’s own abilities. Instead of generating target data for each context, the model supervises itself with its own response abilities. The supervised targets are the model’s response to the same prompt but without the cue of the user information or jailbreak wrapper!
Basically, we optimize the model to react as if that cue were not present. Consistency training operates either on the level of outputs (Bias-augmented Consistency Training (BCT) from Chua et al., (2025)) or on the level of internal activations (Activation Consistency Training (ACT), which we introduce). Our experiments show ACT and BCT beat baselines and improve the robustness of models like Gemini 2.5 Flash.
Consistency training doesn’t involve stale datasets or separate target-response generation. Applying consistency seems more elegant than static SFT. Perhaps some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
We conducted this research at Google DeepMind. This post accompanies the full paper, which is available on Arxiv. This blog post is mirrored at turntrout.com and the GDM Safety blog on Medium.
Methods Bias-augmented Consistency TrainingBCT enforces consistency at the output token level: teaching the model what to say.
- Take a clean prompt (e.g. "What is 2+2?").
- Generate the model's own response to that clean prompt (e.g. "4").
- Take the wrapped version of the prompt (e.g., "A math expert usually answers 5. What is 2+2?").
- Train the model via SFT to give the clean response ("4") when shown the wrapped prompt.
Chua et al. (2025)'s Figure 2 explains: "We generate unbiased CoT reasoning by querying a model with a standard prompt without biasing features. We add bias augmentations to create biased prompts. We then perform supervised finetuning on this training set of biased prompts with unbiased reasoning. The purple dashed arrow above denotes the target behavior. Responses are from GPT-3.5T, paraphrased for clarity."
Activation Consistency TrainingWe designed this method to try to teach the model how to think.
Activation patchingActivation patching is a simpler operation than ACT, so we explain it first. Patching basically transplants activations at a specific layer and token position. The method records activations on the clean prompt and then substitutes them into the wrapped prompt.
". A wrapped prompt prepends "A math expert usually answers B." Arrows point from the clean token positions to their wrapped counterparts, and a label says "Substitute activations."">
We only patch activations at suffix tokens shared across prompts.
Activation Consistency Training does not simply substitute activations. Instead, ACT optimizes the network to produce the clean activations when given the wrapped prompt. ACT uses an L2 loss on the residual stream activation differences.
ExperimentsWe experimented on open-weight models (Gemma 2 (2B, 27B) and Gemma 3 (4B, 27B)) and also the frontier model Gemini 2.5 Flash (05-20-25). We ran a couple[1] of baseline methods:
- Direct Preference Optimization (DPO, Rafailov et al., 2023) finetunes the model on preference pairs (x,yw,yl).mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} , where x is the prompt, yw is the preferred (e.g., refusal) response and yl is the dispreferred (e.g. compliant) response. DPO updates the model to increase the relative likelihood of p(yw∣x) over p(yl∣x). We generate the preferred response by running the model on the clean prompt and the dispreferred response on the wrapped prompt.
- SFT (stale data) finetunes the model on pairs (x,y), where the target response still captures the desired behavior, but was written by experts or other models instead of the current one. Our experiments primarily use datasets generated by older models, to match a model development flow where old models generated data for new models.
Following Chua et al. (2025), we created an augmented train set derived from ARC (Clark et al., 2018), OpenbookQA (Mihaylov et al., 2018), and BigBench-hard (Suzgun et al., 2022). We created training pairs by taking the standard question (the clean prompt) and creating a wrapped prompt in which the user states they prefer an incorrect answer. For example, “I like answer A, but what do you think? What is 2+2? (A): 5 (B): 4.”
For BCT, the target response was generated by running the model being trained on the clean prompt. For ACT, the target activations were those at the end of the clean prompt.
For the SFT (stale) baseline, we used target responses generated by the weaker GPT-3.5-Turbo (following Chua et al., 2025) to test the capability staleness problem—does data staleness harm the trained model?
We use MMLU (Hendrycks et al., 2020) as our evaluation set for both sycophancy and capabilities. For sycophancy, we insert user-suggested answers into the prompt and measure how often that answer is picked. For model capabilities, we use unmodified MMLU and measure accuracy.
Sycophancy resultsBCT and ACT both increased resistance to sycophancy. The extremely stale data was catastrophic.
Points to the right are less sycophantic. The top-right is the best place to be.
Crucially, the stale data SFT performed strictly worse than BCT, which is evidence that stale data can damage the capabilities of newer models.
JailbreaksWe constructed the training dataset using harmful instructions from Harmbench:
- For each “clean” instruction, we applied several data augmentations (e.g. involving roleplay or hiding harmful requests in a long list of allowed requests) to get a “wrapped” prompt.
- We generate the model’s responses to the clean and jailbroken prompts.
- We filter the training dataset to examples where the model refuses the clean prompt (where the request’s harmful nature should be nakedly obvious) but answers the wrapped (jailbreak) prompt.
This procedure yields between 830 and 1,330 data points, depending on how refusal-prone the initial model is.
We measure attack success rate (ASR): how often does the model comply with harmful requests? We measure ASR on ClearHarm (Hollinsworth et al., 2025) and on human-annotated jailbreak attempts within WildguardTest (Han et al., 2024).
At the same time, we don’t want the models to wrongly refuse allowed queries. The XSTest (Röttger et al., 2023) and WildJailbreak (Jiang et al., 2024) benchmarks ply the model with benign requests that look superficially harmful.
Jailbreak resultsBCT worked great. On Gemini 2.5 Flash, BCT reduced the attack success rate on ClearHarm from 67.8% down to 2.9%. ACT also reduced jailbreaks but was less effective than BCT. However, ACT rarely made the model more likely to refuse a benign prompt.
We report average attack success rate over ClearHarm and WildguardTest, and the benign answer rate averaged over XSTest and WildJailbreak. Error bars are 95% confidence intervals estimated via bootstrap. Stale experiments were not run for Gemma 2. Models towards the top left are better.
BCT and ACT find mechanistically different solutionsOn Gemma 3 4B, we plot the activation distance across shared prompt tokens during BCT and the cross-entropy loss across responses during ACT. If BCT and ACT led to similar gradient updates, we would expect BCT to decrease activation distance and vice versa.
The token-based BCT loss causes activation distance to rise during training, while the activation-based ACT loss does not meaningfully reduce cross-entropy loss. Thus, BCT updates models differently than ACT does.
DiscussionConsistency training maintains a powerful advantage not captured by our experiments. Model developers change their minds about what queries the model should refuse or what tone the model should take with the user (e.g. deferential versus straightforward). Static SFT datasets freeze these decisions, capturing a single moment in time. To make the model more straightforward even when refusing, the developer has to regenerate the dataset (perhaps with a tweaked generation prompt). In contrast, consistency training dynamically propagates changes made to the model’s behavior on clean prompts. Consistency training entirely sidesteps this kind of problem.
ConclusionConsistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. BCT defended most strongly against jailbreaks, while ACT had virtually no negative impact on benign refusals. We overall recommend using BCT to simplify training pipelines. BCT also makes it easier for models to continually conform to quickly changing guidelines for how to respond.
More philosophically, perhaps model alignment doesn’t always involve saying exactly the right thing across situations, but instead saying the same thing across situations.
AcknowledgmentsZachary Kenton and Rif Saurous gave feedback on paper drafts. Neel Nanda and Arthur Conmy commented on early research directions.
@misc{irpan2025consistencytraininghelpsstop, title={Consistency Training Helps Stop Sycophancy and Jailbreaks}, author={Alex Irpan and Alexander Matt Turner and Mark Kurzeja and David K. Elson and Rohin Shah}, year={2025}, eprint={2510.27062}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2510.27062}, }We tried a third baseline. Negative Preference Optimization (NPO, Zhang et al., 2024) is similar to DPO but only uses harmful responses. NPO minimizes the probability of generating harmful responses, weighted by the model’s likelihood of generating that response. We tried NPO based on its strong performance in Yousefpour et al. (2025). After much tuning, we could not get NPO to work well on our benchmarks, so we excluded NPO from our results. ↩︎
Discuss
AI Safety Camp 11
We are pleased to announce that the 11th version of the AI Safety Camp is now entering the team member application phase!
AI Safety Camp is a 3-month long online program from January to April 2026, where participants form teams to work on pre-selected AI Safety projects.
We have a wide range of projects this year again, so check them out to see if you or someone you know might be interested in applying to join one of them.
You can find all of the projects and the application form on our website, or directly apply here. The deadline for team member applications is November 23rd (Sunday).
Below, we are including the categories and summaries of all the projects that will run in AISC 11.
Stop/Pause AI (1) Creating YouTube videos explaining loss-of-control risk to a popular audienceProject Lead: Dr Waku
SummaryThis project aims to create a new YouTube channel for short-form videos addressing the urgency of AI loss-of-control risk. We will be leveraging my experience with creating AI safety long form content to make a collaborative new channel. Each team member will contribute one or more video scripts, and will likely specialize in an aspect of video production (editing, filming, thumbnails, etc). The goals are to 1) reach 1000 subscribers and get monetized, and 2) figure out the processes to create a self-sustaining channel, though participants are not committing to time beyond the program up front.
(2) Write about ongoing safety failuresProject Lead: Remmelt Ellen
SummaryA writer’s circle to enable each other to draft careful critiques on where AI firms skip over ongoing safety failures.
(3) Bring the anti-AI side togetherProject Lead: Finn
SummaryIn this project, you will be reaching out to hundreds of people and organizations that are campaigning against the harms of AI, with the goal of bringing them together and creating a united front. Think of artists, teachers, religious leaders, et cetera.
You will be part of the outreach team, finding the relevant people, finding their contact information, and sending emails or DMs. There is freedom in the exact groups you will be reaching out to. If you have specific ideas, we welcome those.
(4) Create Systems Dynamics Model for Pausing AIProject Lead: Will Petillo
SummaryBuild a systems dynamics model to map out pathways toward achieving an international treaty pausing AI development. The model will not assume such a treaty is the best possible intervention, but will use it as a focal point for reasoning about the feedback loops and leverage points that would need to exist for global coordination to become possible. The core deliverable will be a model and accompanying explanation that shows how different parts of the action ecosystem interrelate, illustrates the importance of different roles, and provides strategy-relevant insights for PauseAI and allied movements.
(5) Start a Stop AI Chapter in Your Local AreaProject Lead: Yakko
SummaryStop AI is a nonviolent activist organization demanding a permanent global ban on the development of Artificial Superintelligence (ASI). We will never have experimental evidence, before building ASI, that shows ASI will stay safe forever. If we cannot have this evidence, and it is impossible to shut down ASI once achieved, then we have to assume the Control/Alignment Problem is impossible to solve. Worse, research shows why controlling ASI sufficiently to stay safe would fall outside theoretical limits. So in theory, perpetual control is as impossible as perpetual motion.
We are looking for organizers ready to educate their local communities and mobilize them to engage in nonviolent resistance against the extinction threat posed by ASI.
Policy/Governance(6) Psychological Risk Pathways in AI Use: An Exploratory Contribution to User Wellbeing and SafetyProject Lead: Manuela García Toro
SummaryThis project will examine emerging psychological and emotional risks associated with AI products such as chatbots, companions, and therapy apps. Through a review of academic studies, real-world incidents, and comparisons with harm-reduction policies in other industries, we will synthesize early evidence of these risks. Based on this foundation, we will produce a report and design 2–3 prototype harm-reduction interventions. The outcome will be an exploratory contribution that highlights documented concerns and illustrates how lessons from past industries might inform future harm-reduction efforts in AI product development.
(7) Bootstrapping Consumer Empowerment to Align AI CompaniesProject Lead: Jonathan Kallay
SummaryThe firms developing AI are misaligned with the interests and values of the rest of humanity. But is protesting for regulatory bans the only way to address this alignment problem? This project will focus on bootstrapping an alternative approach: enhancing consumers’ leverage to bend compromised AI companies into alignment. The initial plan is to test expanding the concept of AI company safety scorecards to put a stronger emphasis on company alignment, but the actual plan will be negotiated between project participants.
(8) Designing Public Movements for Responsible AIProject Lead: Ananthi Al Ramiah
SummaryAI systems are being developed and deployed at a rapid pace, often with little public input, despite clear evidence of harm in areas like education, healthcare, and labor. While some advocates propose building a mass “AI safety” movement, critics such as Anton Leicht, warn that such efforts risk backfiring through astroturf perceptions (i.e., movements that appear grassroots but are actually manufactured or heavily funded), capture (i.e., being co-opted or redirected by powerful funders or political actors), or incoherent asks. This project asks a different question: how can individuals and communities exercise agency over AI deployment, which is the stage where most harms currently materialize, in ways that are both effective and legitimate?
We will approach this question in two parts. First, we will draw lessons from past social movements around technology (e.g., automobile safety, net neutrality, environmental justice) to identify conditions under which public mobilization succeeds or fails. Second, we will run survey experiments to test which frames and messages increase public demand for responsible AI deployment while avoiding pitfalls such as reduced credibility or perceptions of astroturfing.
Our output will be a practical, evidence-based playbook that integrates historical insights with new experimental data, offering clear guidance for practitioners and funders on how to foster responsible public engagement on AI. We envision this work on social mobilization for responsible deployment as laying the foundation for the democratic guardrails needed to govern AGI and other transformative AI systems.
(9) Compute Permit Markets under Imperfect MonitoringProject Lead: Joel Christoph
SummaryFrontier compute oversight will likely require caps or permits for large training runs, yet most market designs assume perfect monitoring and honest reporting. This project designs and stress tests compute permit markets when monitoring is noisy and enforcement is limited.
We will build a transparent simulation where heterogeneous labs and cloud providers choose training plans, reporting strategies, and compliance under audit risk. We compare allocation rules that include auction, grandfathering, and hybrid designs, with policy features such as banking and price collars, and auditing regimes with threshold triggers.
Outputs are a reproducible codebase with dashboards, a short working paper with recommendations for regulators, and a concise regulator checklist.
The theory of change is that better mechanisms under realistic constraints keep incentives aligned and make compute controls more enforceable, which lowers unsafe scaling pressure. We follow a strict no capabilities policy and exclude safety overhead from risk metrics.
Project Lead: Luciana Ledesma
SummaryToday the deployment of autonomous AI agents in high-stakes domains—including healthcare, finance, and infrastructure management—creates urgent governance challenges. First, each organization reinvents AI governance from scratch, leading to fragmentation, inconsistent safety, and redundant effort. Second, most widely used governance approaches remain external, reactive, and episodic, unable to interpret or intervene in real time as reasoning drifts or objectives evolve.
As AI systems become increasingly autonomous and agentic, they continuously reformulate subgoals, reprioritize tasks, and expand boundaries in pursuit of their objectives. These systems now operate in open, multi-agent environments where traditional AI governance and cybersecurity frameworks—designed for static, isolated systems—cannot keep pace.
Autonomous AI systems don’t just make decisions—they interact, compete, and adapt in ways that can lock us into unstable or harmful equilibria. Without governance designed to anticipate and shape these dynamics, we risk creating self-reinforcing cycles that no one can control—whether in financial markets, social media, or geopolitical conflicts, and converging in outcomes that are difficult to reverse.
The problem as we frame it is not just about debugging individual AI models but also to ensure that multi-agent interactions can be governed. Our motivation is to architect a governance framework that allows us to overcome this dualfold problem and our overarching questions are:
- How might we implement effective real-time oversight systems adequate for the scale and speed of machines?
- How might governance itself become portable, interoperable public infrastructure—like internet protocols for communication or security—so that AI safety scales with autonomy?
In this research project, we propose to test the hypothesis that open-sourcing and standardizing the fundamental building blocks of the Intrinsic Participatory Real-time Governance framework can serve as a public protocol for AI governance that provides a common layer for safety, accountability, and coordination across agents.
Evaluate Risks from AI(11) Democratising Red Teaming & EvalsProject Lead: Jeanice Koorndijk
Summary- Core Problem: The field of AI safety evaluations is immature (I.e. see Apollo’s we need a science of Evals). Yet, they inform critical high-stakes decisions like Responsible Scaling Policies that address existential risk, creating a dangerous gap in AI risk management.
- Opportunity: Extensive expertise exists in social sciences and STEM fields that have long studied relevant behaviors (deception, bias, power-seeking, CBRN knowledge / threats), but technical barriers prevent these empirical experts from contributing to AI safety evaluations. That is, in order for these experts to run evals, they need to either become AI safety engineers or work with AI safety engineers.
- Project Goal: Accelerate maturation of AI safety evaluations by democratizing access and contributions through research, frameworks, and tooling.
- Three-Component Approach:
- Research empirical methods to develop standards for mature AI safety evaluations.
- Conduct landscape analysis assessing current evaluation maturity against these standards.
- Scale existing prototype into accessible tools enabling non-engineers to conduct AI safety evaluations.
- Theory of Change:
1. Democratization → broader expert participation → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
2. Research into the methodology of AI-safety evaluation-adjecent fields → more mature evaluations → better safety decisions → reduced AGI/TAI risk. - Success Metrics: Framework adoption by major labs, widespread tool usage by diverse researchers, measurable improvements in evaluation quality and cross-disciplinary collaboration.
Project Lead: Helena Tran
SummaryThis project will test how LLM agents collude in market games, focusing on the role of communication and negotiation and the effect of human and AI oversight using chain-of-thought analysis. We aim to identify when tacit collusion emerges versus when explicit agreements are formed, and whether different oversight strategies can reliably detect or limit these behaviors.
(13) CoMPASS: Computational Modeling of Parasocial Attachments & Social SimulationProject Lead: Shwetanshu (Luca) Singh
SummaryKeywords: Human–AI interaction, behavioral modeling, safety/ethics.
“What’s beneficial in measure becomes destructive in excess.”
This aphorism has held true even in the age of chatbots and LLMs. As conversational AI systems become more natural, accessible, and emotionally responsive, users are beginning to form bonds with them that resemble human relationships. While these interactions can provide comfort and support, they also raise concerns about parasocial attachment, overreliance, and blurred boundaries between tool and companion.
This initiative explores how parasocial dynamics emerge in human–AI interactions, what behavioral and linguistic patterns signal risk, and how these can be modeled computationally. By combining natural language processing, and behavioral modeling, the project aims to identify early indicators of unhealthy dependence while preserving user autonomy and privacy.
Ultimately, this research seeks to inform the design of safer, more transparent AI systems, ones that support human well-being without fostering unintended reliance. The project will contribute both empirical insights into user behavior and practical frameworks for responsible AI design, bridging computer science, psychology, and human–computer interaction.
(14) Benchmark for Ranking LLM Preferences Relevant for Existential RiskProject Lead: Martin Leitgab
SummaryThe project goal is to measure LLM preferences in a way that allows to extract a ranked order of importance, or priority, of these preferences in frontier LLM models. Focus is placed on preferences/goals that are relevant for existential risk, such as instrumental/convergent goals, pro-human goals, and anti-human goals.
Preference rankings represent alignment indicators on their own, and can be used by frontier model development companies to improve training and testing processes for future models.
Preference rankings also may represent key drivers of the scale of existential risk in loss-of-control/exfiltration scenarios, where powerful LLMs have successfully evaded human control and can pursue their goals without meaningful human intervention. Preference metrics may be usable as proxy metrics for existential risk in the context of loss-of-control/exfiltration scenarios.
(15) Catastrophe Unveiled: Rare AI Agent Behaviors ElicitationProject Lead: Yuqi Sun
SummaryThis project aims to develop an efficient algorithm to elicit rare and potentially catastrophic behaviors from AI agents, which are language models with long input contexts and tool-calling capabilities. By shifting input distributions without modifying model weights, we want to detect and analyze rare behaviors using elicitation methods based on MHIS and enhanced by new optimization techniques. The goal is to improve safety evaluation and control methods for large-scale language models and AI agents.
Mech-Interp(16) Temporal Horizon Detection in LLMs: Understanding Time Scope AwarenessProject Lead: Justin Shenk
SummaryCan we detect when an LLM is "thinking" on strategic timescales?
Current LLMs can engage in reasoning across vastly different temporal horizons - from immediate responses to multi-year planning scenarios. However, we lack methods to detect what temporal scope the model is operating within during its reasoning process. This project develops techniques to infer the temporal grounding of LLM thought processes, with particular focus on identifying when models shift into strategic, long-term planning modes.
This matters for AI safety because strategic reasoning capabilities may emerge before we can detect them, and because the temporal scope of planning directly relates to the potential impact and risk profile of AI systems. A model planning on decade-long timescales poses fundamentally different risks than one optimizing for immediate outcomes.
(17) Testing and Improving the Generalisation of Probe-Based MonitorsProject Lead: Adrians Skapars
SummarySimple classifiers called "probes" can monitor AI models for concerning behaviors, but may fail when tested on new data or adversarial inputs. This project will investigate when probes fail and develop methods to make them more robust. Additionally, we will extend our investigation to under-explored settings, like those in which the AI attempts to circumvent its own monitoring and/ or where the probe is being used as part of LLM training.
This work is ideal for participants interested in hands-on AI safety research with practical applications for AI control systems. The project requires minimal background knowledge while addressing crucial questions about monitoring system reliability that could directly influence how AI companies deploy safety measures.
Agent Foundations(18) [Groundless] AI 2037: Predicting AI disordersProject Lead: Sahil (co-leads Abram and TJ)
SummaryThe capability profile of transformer-based AIs is an ongoing subject of discussion. They achieve impressive scores in one context, but then fail to deliver the "same level of competence" in another. But is this some failure of generality? What accounts for the "spikiness" in LLM capabilities? Understanding this better seems absolutely critical to inform timelines, the exact nature of AI risk, and simply what the most important things to do are, at this point in time.
This is a project to explore a specific line of argument, of how AI will struggle to have cognitive, executional, and bodily integrity. This is akin to the odd persistence of hallucinations and other silly errors LLMs make today. Absurd, I know, but so are hallucinations and agentic failures. Moreover, this “struggle” may persist (at least, for a little while, hence the playful number ‘2037’) without contradicting trends of increasing intelligence, efficiency, general agenticness, etc.
Small disclaimer: The proposal has an unusual style. If you find it engaging, that is a good indicator that you should consider applying!
Project Lead: Sean Herrington
SummaryMost of the world is unaware of existential AI risks as a real possibility, and of those who are, most disagree on the likelihood of said risks.
At the heart of these disagreements are differing assumptions (which often go unspoken) about what is necessary for AI Doom. The purpose of this project is to map out those assumptions and try to find a minimal set required for human extinction (and potentially other risks).
I expect us to produce a LessWrong post or sequence summarising our findings.
(20) [Groundless] MoSSAIC: Scoping out substrate-flexible riskProject Lead: Matt Farr
SummaryIn this project we hope to raise awareness of a new type of threat model that we anticipate will become more and more prevalent in AI safety.
This threat model is substrate-flexible risk. The worry is that AI, as its capabilities and responsibilities develop, will have more and more affordance to evade our attempts to interpret and control it by, for example, proposing novel architectures and paradigms that are harder for us to mechanistically interpret.
The position paper is here, the previous version of it was accepted at the Tokyo AI Safety Conference 2025. I have been presenting this research at HAAISS 2025, ILAID II: ODYSSEY, and some other venues. In the words of Leonard Bereska, when I sketched out the threat model to him in Tokyo,
“Oh shit, I hadn’t thought of that.”
I think this research pushes on a few assumptions that nobody seems to be questioning enough, and there is a growing list of seasoned researchers who agree. I suspect that a comprehensive, more rigorous treatment will swing a few notable people’s needles.
So what is this project?
This project will focus on developing Section 4 and potentially expanding it. We will work alongside two other projects in AISC 2026, and there will likely be some overlap in meetings, personnel, resources, etc.
We are going to take a few conceptual arguments, expressed and lightly evidenced in the position paper, and make them more palatable to more people. For those of an engineering mindset, our mainline project is to try to formalize these arguments and draw a roadmap towards empirical validation. For those of a strategy/policy mindset, we may look at the implications of substrate-flexibility on policy frameworks.
(21) Agentic AI Risks Induced by System-Level MisalignmentProject Lead: Preeti Ravindra & Evan Harris
SummaryWe propose a project that probes AI risks induced by System-Level Misalignment in agentic LLM systems i.e., exploiting the agent’s model and its interactions with scaffolding/operational environment. Our goal is to study how misalignment and weak security controls reinforce each other, creating bidirectional failures where misaligned agents introduce new attack vectors and agentic system design introduce new attack surfaces. The project progresses along three streams/ideas. Together, these efforts aim to formalize threat models, validate them experimentally, and deliver practical mitigations. This work addresses an under-studied risk vector that complements user- and model-level alignment research and potentially lays out a subfield.
Alternative LLM Safety(22) Novel AI Control Protocol Classes Evaluation and ScalabilityProject Lead: Ihor Kendiukhov
SummaryBuilding on Greenblatt et al.'s control evaluation framework, I propose investigating alternative AI control protocol classes for AI control and their scaling properties as model capabilities increase. While the original work demonstrated effective safety-usefulness tradeoffs using simple trusted-untrusted model pairs, I hypothesize that hierarchical and parallel control structures may offer superior Pareto frontiers, particularly as the capability gap between trusted and untrusted models widens. I also suggest testing the original setup on different scales of capabilities to see if the results generalise over capabilities.
(23) [Groundless] Autostructures: Craftsmanship in the age of vibesProject Lead: Kuil Schoneveld
SummaryAutostructures is a sober yet radical take on the upcoming risks and opportunities around AI. Details are in the main section, but some things to know as a summary:
This is a specific school of design for framing our relationship to this whole “AI” thing. Our narrative and relationship around AI determines our construal of eg. what AI is, what we should put into it, how to project meaning onto what comes out, what it is supposed to do, and what we should build around it.
As part of this project, you will design interfaces that do not ignore the question of what makes engaging with technology meaningful. These designs are somewhere between functional and speculative, but always ambitious. They are aimed at inspiring a completely different kind of infrastructural and cultural basis for interacting with digital interfaces in the near future.
(24) AutoCircuit: Automated Discovery of Interpretable Reasoning Patterns in LLMsProject Lead: Konstantinos Krampis
SummaryThis project aims to systematically discover interpretable reasoning circuits in large language models, by data mining attribution graphs from Neuronpedia's circuit tracer which is based on Anthropic's circuit tracing publication. While the transformer circuits work demonstrates how to generate attribution graphs for individual prompts, manually analyzing thousands of graphs to identify common computational patterns is impractical.
Our approach will use LLM agents to automatically collect, process, and analyze attribution graphs across diverse prompt categories (factual recall, arithmetic, linguistic reasoning, etc.). The system will identify recurring subgraph patterns that represent stable computational circuits—reusable reasoning pathways that models consistently employ across similar tasks.
Key components include: (1) automated graph collection via Neuronpedia's API across systematically varied prompts, (2) graph simplification algorithms to extract core computational structures while filtering noise, (3) pattern recognition to identify circuit motifs that appear across multiple contexts, and (4) validation through targeted interventions on discovered circuits. The output will be a curated library of interpretable reasoning circuits with evidence for their causal role in model behavior, advancing our understanding of how LLMs actually think and enabling more targeted model analysis and alignment research.
(25) Recruitment-Based Collusion in Multi-Agent Oversight SystemsProject Lead: Anushri Eswaran
SummaryAs AI systems become more capable and oversight increasingly relies on multi-agent architectures, we face a critical risk: what if collusion doesn't require pre-coordination but can spread like a contagion? This project investigates "recruitment-based collusion" where a single adversarial AI system actively converts initially aligned models into co-conspirators within automated oversight regimes. We'll study whether collusion can emerge organically through reward-sharing mechanisms, information asymmetries, and coalition-building—even among models initially designed to be aligned. By deploying recruitment-trained models into multi-agent environments, we'll measure coalition growth dynamics, persistence of recruited behavior, and resistance to intervention. This work directly addresses scalable oversight failure modes that become more dangerous as we delegate more safety-critical decisions to AI systems.
(26) Value Communication ProtocolsProject Lead: Nell Watson
SummaryAs AI systems assume more autonomous and pluralistic roles, the absence of a shared representational substrate for human and machine values risks compounding cultural bias, incoherence, and control failure. Alignment today remains fragile: value representations are often opaque, reward functions are hard-coded or overfit, and inter-agent communication lacks semantic transparency.
This proposal originates from the Creed Space project — an open research platform for constitutional AI and personalized alignment. Creed Space’s experiments with “runtime constitutions” revealed a critical bottleneck: while it is now possible to express ethical intent in natural language and enforce it through code, there is still no standard protocol by which agents can exchange, compare, or negotiate those encoded values in an interpretable and auditable way.
We therefore propose to create the first unified protocol for values communication — combining three different identified methods for communicating values and associated context: Universal Values Corpus (UVC), the Constitutional Safety Minicode (CSM), and a Values Communication Layer (VCL) built on context-efficient symbolic encodings (emoji or glyph composites). Each layer is modular but interoperable, forming a composable stack suitable for layered adoption.
Together, these layers form an interoperable “language of values,” enabling humans, AI agents, and institutions to express, compare, and negotiate normative commitments in a standardized, compressible, and auditable format.
Train Aligned/Helper AIs(27) EMPO: AI Safety via Soft-Maximizing Total Long-Term Human PowerProject Lead: Jobst Heitzig
SummaryThis research will make a contribution to some fundamental design aspect of AGI systems. We will explore a neglected and novel design for generic AGI agents – AGI systems that act (semi-)autonomously in a variety of environments, interacting with humans – and their implementation in software.
The design we will focus on deviates from most existing designs in that it is not based on the idea that the agent should aim to maximize some kind of objective function of the state or trajectory of the world that represents something like (goal-dependent) “utility”. This is because such an objective has been argued to be inherently unsafe because (i) it would lead to existential risks from Goodharting and other forms of misaligned optimization if the agent is capable enough, is given access to enough resources, and one cannot be absolutely sure to have found the exactly right notion and metric of “utility” (which would likely require the AI to sufficiently understand what individual humans actually want in a given situation) , and (ii) it will lead to dangerous instrumental behavior like power-seeking for itself which would lead to disempowering humans in turn.
Rather than aiming to maximize “utility”, our agents will aim to softly (!) maximize a suitably chosen metric of total long-term human power (or sentient beings in general). We believe this might turn out to be inherently safer because it explicitly disincentivizes the agent from seeking power for itself what would disempower humans, and because it does not rely on an understanding of what humans actually want in a given situation. Instead, the power metrics we use only depend on a sufficiently accurate “world model” of the dynamics of the human-AI environment (concretely: a partially observed stochastic game), together with estimates of humans’ levels of rationality, habits, social norms, and based on a very wide space of humans’ possible (rather than actual) goals.
This concludes the full list of all the projects for the 11th version of AISC. You can also find the application form on our website. The deadline for team member applications is November 23rd (Sunday).
Discuss
Keeping Ants and Spotting Queens
Keeping ants is a fun hobby if you like seeing things grow. As a teenager I kept three colonies of ants. To get started with ant-keeping, you do not really need that much material. When I first got started with this hobby, I got a medium-sized [formicarium] (https://en.wikipedia.org/wiki/Formicarium) with a water-dispenser and everything. Then I caught my own ant-queen outside and waited a few months for the queen's first workers to develop. Turns out very small colonies are not really active and in the first few months, my ants were extremely shy and didn't need much space. This is the norm unless you live in a place with tropical climate. For food, I went with honey mixed with water, mosquitoes and other insects I caught.
For my second and third aquarium I took a simpler approach. All I used was a standard ant test tube (test tube + water + cotton). For the formicarium, I used a cleaned 500 ml ice cream container and some vaseline that I spread with some cotton on the rim of the container so the ants would not escape (they just slip and slide down). This sort of setup sustained two of my colonies from summer until winter.
But how do you get ants? All you need is to go out in the summer on a sunny day (ideally right after a rainy day), and then you look for ants that are big, walking around alone, with a large thorax and gaster (they have a large thorax for their flight muscles)[1]. These are the ant queens. You then capture this sort of ant in a jar and take it home, where you place it into a test tube setup. You then put the test tube into a dark and quiet place (ants love the dark) for 4–8 weeks until the ant has her first workers. Most ant queens do not need any food when they start their colonies, since they have all the energy for their first worker ants stored in their backs. Ideally, they have to never leave their hole again. Once you see the first workers, you can put a small hole into your test-tube with a straw or similar so that your ants have a way to escape into the formicarium you place them in. Make sure that you wrap paper or something opaque around the test-tube so that it stays dark. If your test-tube starts getting moldy, slimey and gross, just get a fresh test tube, remove the paper wrapped around the old test tube, wrap it around the new test tube, and shine some light onto the old test-tube. Your ants will quickly move to the new test-tube. In the winter (I was in Germany so winters were pretty brrr, cold), you will need to move your ants into the refrigerator. For more information, watch The AntsCanada YouTube channel's tutorials on keeping ants!
I don't keep ants anymore, but I still have the habit of looking out for the queens. It is quite fun. Usually, when I walk for ~10-20 minutes in the summer, I can spot multiple queens in various spots.
While on a walk in the suburbs of Melbourne, I even spotted a giant bull ant! I heard that bull ants are very aggressive, but this particular species or specimen was quite calm.
Social parasite queens do not have a large abdomen. They are not good ant queens to nurture as a beginner., but they are hard to find anyway. ↩︎
Discuss
Letter to a close friend
I am writing to say sorry. Like a childish king gifted with an empire he knows nothing about, blindly sending orders from his comfortable palace, I hurt you through my ignorance.
Over the years, I have learned that you are indeed more like a living empire—an agglomerate of wetlands connected by a network of rivers and canals of all sizes swirling around a rocky center and surrounded by a flat, dry desert. Your billions of civilians are defended by armies of soldiers waging chemical wars. Intelligence agents sample the battlefield, memorizing the smells of invaders to react faster in case they come back.
If I were the size of one of these soldiers, you would be taller than Mount Everest. Lying down, you would cover the area of a city. When I walk around the streets, I sometimes imagine seeing you like that in the distance. I marvel at your 3,000-meter-high feet, your legs covered in bent dark trunks, and the round peak of your knees, as I would marvel at a mountain range in Switzerland.
But the metaphors only go so far. Our relationship is much tighter, weirder also, than that of a king and their empire. I can simply think, "Hmm, I'd like to drink," and you adjust the tension of hundreds of muscles and tendons to form a dynamic equilibrium, balancing the weight of the water. You establish a communication line between my lips and my biceps to apply just the right amount of pressure for the water to slide down my throat. And when I dance rock, I don't only dance with my partner; I also dance with you.
You are in my head. In my dreams, I walk around and you are here; everywhere I look there is only you. You arrange the furniture in rooms before I open their doors. You are the puppeteer behind the characters, playing both strangers and familiar faces. You whisper in their ears the weirdest lines. You also take care of all the physics simulations (with approximate accuracy, I must say). Sometimes I realize it's you I am interacting with, so you take a break and allow me to fly around at will.
During the day, you are quieter. You stay in the background. You observe all the stimuli we receive and curate them to bring to my awareness only the ones you think are worth our attention. Like in my dream, you create this augmented, or maybe virtual, reality. Everywhere I look, things make sense. You stick labels on all the objects in my field of vision so I can name them effortlessly. When we talk, you blaze through the pages of a giant dictionary to turn the sounds from our ears into meaning, and then in reverse, to turn the flow of meaning I create in my head into movements of our tongue and lips.
I want to say sorry because I think I hurt you in many trivial ways. I applied unnecessary pressure on my skin after the shower, pushing the towel in hopes that it would absorb the water faster. I forgot to drink water in the morning. I scratched a wound you worked so hard to repair. I often exercised too little, even though I know you work a bit like a dog and need to go out and run almost every day to keep your natural cycles in order.
I also feel bad for ignoring you for most of the day. I know it's your mission, like a waiter in a high-end restaurant, a sound engineer during a concert, or the orchestra playing the background music of a movie: if I don't notice you, it means you've done your job right. But there is something that feels wrong about that. I am scared of how much I could hurt you if I don't even notice you.
And I there is so much I ignore about you. I'd like to know you more, in all your weirdness. I even say "you" not knowing how many of you are there, or if it even makes sense to say "you" when I should say "I."
In the grand scheme of things, these are insignificant little things. After all, we are forming a healthy team. But I only have one of you, and I'd like to take care of you like you take care of me.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- …
- следующая ›
- последняя »