Вы здесь

Сборщик RSS-лент

Apply to the Cambridge ERA:AI Winter 2026 Fellowship

Новости LessWrong.com - 1 ноября, 2025 - 01:26

Published on October 31, 2025 10:26 PM GMT

Apply for the ERA:AI Fellowship! We are now accepting applications for our 8-week (February 2nd - March 27th), fully-funded, research program on mitigating catastrophic risks from advanced AI. The program will be held in-person in Cambridge, UK. Deadline: November 3rd, 2025.

→ Apply Now: https://airtable.com/app8tdE8VUOAztk5z/pagzqVD9eKCav80vq/form

ERA fellows tackle some of the most urgent technical and governance challenges related to frontier AI, ranging from investigating open-weight model safety to scoping new tools for international AI governance. At ERA, our mission is to advance the scientific and policy breakthroughs needed to mitigate risks from this powerful and transformative technology.During this fellowship, you will have the opportunity to:

Design and complete a significant research project focused on identifying both technical and governance strategies to address challenges posed by advanced AI systems.
Collaborate closely with an ERA mentor from a group of industry experts and policymakers who will provide guidance and support throughout your research.
Enjoy a competitive salary, free accommodation, meals during work hours, visa support, and coverage of travel expenses.
Participate in a vibrant living-learning community, engaging with fellow researchers, industry professionals, and experts in AI risk mitigation.
Gain invaluable skills, knowledge, and connections, positioning yourself for success in the fields of mitigating risks from AI or policy.
Our alumni have gone on to lead work at RAND, the UK AI Security Institute & other key institutions shaping the future of AI.

I will be a research manager for this upcoming cohort. As an RM, I'll be supporting junior researchers by matching them with mentors, brainstorming research questions, and executing empirical research projects. My research style favors fast feedback loops, clear falsifiable hypotheses, and intellectual rigor.

I hope we can work together! Participating in this last Summer's fellowship significantly improved the impact of my research and was my gateway into pursuing AGI safety research full-time. Feel free to DM me or comment here with questions.

Discuss

FAQ: Expert Survey on Progress in AI methodology

Новости LessWrong.com - 31 октября, 2025 - 19:51

Published on October 31, 2025 4:51 PM GMT

Context

The Expert Survey on Progress in AI (ESPAI) is a big survey of AI researchers that I’ve led four times—in 2016, then annually: 2022, 2023, and 2024 (results coming soon!)

Each time so far it’s had substantial attention—the first one was the 16th ‘most discussed’ paper in the world in 2017.

Various misunderstandings about it have proliferated, leading to the methodology being underestimated in terms of robustness and credibility (perhaps in part due to insufficient description of the methodology—the 2022 survey blog post was terse). To avoid these misconceptions muddying interpretation of the 2024 survey results, I’ll answer key questions about the survey methodology here.

This covers the main concerns I know about. If you think there’s an important one I’ve missed, please tell me—in comments or by email (katja@aiimpacts.org).

This post throughout discusses the 2023 survey, but the other surveys are very similar. The biggest differences are that a few questions have been added over time, and we expanded from inviting respondents at two publication venues to six in 2023. The process for contacting respondents (e.g. finding their email addresses) has also seen many minor variations.

Summary of some (but not all) important questions addressed in this post.How good is this survey methodology overall?

To my knowledge, the methodology is substantially stronger than is typical for surveys of AI researcher opinion. For comparison, O’Donovan et al was reported on by Nature Briefing this year, and while 53% larger, its methodology appeared worse in most relevant ways: its response rate was 4% next to 2023 ESPAI’s 15%; it doesn’t appear to report efforts to reduce or measure non-response bias, the survey population was selected by the authors and not transparent, and 20% of completed surveys were apparently excluded (see here for a fuller comparison of the respective methodologies).

Some particular strengths of the ESPAI methodology:

Size: The survey is big—2778 respondents in 2023, which was the largest survey of AI researchers that had been conducted at the time.
Bias mitigations: We worked hard to minimize the kinds of things that create non-response bias in other surveys—
- We wrote individually to every contactable author for a set of top publication venues rather than using less transparent judgments of expertise or organic spread.
- We obscured the topic in the invitation.
- We encouraged a high response rate across the board through payments, a pre-announcement, and many reminders. We have experimented over the years with details of these approaches (e.g. how much to pay respondents) in order to increase our response rate.
Question testing: We tested and honed questions by asking test-respondents to take the survey while talking aloud to us about what they think the questions mean and how they think about answering them.
Testing for robustness over framings: For several topics, we ask similar questions in several different ways, to check if respondents’ answers are sensitive to apparently unimportant framing choices. In earlier iterations of the survey, we found that responses are sensitive to framing, and so have continued including all framings in subsequent surveys. We also highlight this sensitivity in our reporting of results.
Consistency over time: We have used almost entirely identical questions every year so can accurately report changes over time (the exception is e.g. several minor edits to task descriptions for ongoing accuracy, and the addition of new questions).
Non-respondent comparison: In 2023, we looked up available demographic details for a large number of non-respondents so that we could measure the representativeness of the sample along these axes.

Criticisms of the ESPAI methodology seem to mostly be the result of basic misunderstandings, as I’ll discuss below.

Did lots of respondents skip some questions?

No.

Respondents answered nearly all the normal questions they saw (excluding demographics, free response, and conditionally-asked questions). Each of these questions was answered by on average 96% of those who saw it, with the most-skipped still at 90% (a question about the number of years until the occupation of “AI researcher” would be automatable).1

The reason it could look like respondents skipped a lot of questions is that in order to ask more distinct questions, we intentionally only directed a fraction (5-50%) of respondents to most questions. We selected those respondents randomly, so the smaller pool answering each question is an unbiased sample of the larger population.

Here’s the map of paths through the survey and how many people were given which questions in 2023. Respondents start at the introduction then randomly receive one of the questions or sub-questions at each stage. The randomization shown below is uncorrelated, except that respondents get either the “fixed-years” or “fixed-probabilities” framing throughout for questions that use those, to reduce confusion.2

Map of randomization of question blocks in the 2023 survey. Respondents are allocated randomly to one place in each horizontal set of blocks.

Did only a handful of people answer extinction-related questions?

No. There were two types of question that mentioned human extinction risk, and roughly everyone who answered any question—97% and 95% of respondents respectively, thousands of people—answered some version of each.

Confusion on this point likely arose because there are three different versions of one question type—so at a glance you may notice that only about a quarter of respondents answered a question about existential risk to humanity within one hundred years, without seeing that the other three quarters of respondents answered one of two other very similar questions3. (As discussed in the previous section, respondents were allocated randomly to question variants.) All three questions got a median of 5% or 10% in 2023.

This system of randomly assigning question variants lets us check the robustness of views across variations on questions (such as ‘...within a hundred years’), while still being able to infer that the median view across the population puts the general risk at 5% or more (if the chance is 5% within 100 years, it is presumably at least 5% for all time)4.

In addition, every respondent was assigned another similar question about the chance of ‘extremely bad outcomes (e.g. human extinction)’, providing another check on their views.

So the real situation is that every respondent who completed the survey was asked about outcomes similar to extinction in two different ways, across four different precise questions. These four question variations all got similar answers—in 2023, median 5% chance of something like human extinction (one question got 10%). So we can say that across thousands of researchers, the median view puts at least a 5% chance on extinction or similar from advanced AI, and this finding is robust across question variants.

Did lots of people drop out, biasing answers?

No. Of people who answered at least one question in the 2023 survey, 95% got to seeing the final demographics question at the end5. And, as discussed above, respondents answered nearly all the (normal) questions they saw.

So there is barely any room for bias from people dropping out. Consider the extinction questions: even if most pessimistically, exactly the least concerned 5% of people didn’t get to the end, and least concerned 5% of those who got there skipped the second extinction-related question (everyone answered the first one), then the real medians are what look like the 47.5th and 45th percentiles now for the two question sets, which are still both 5%.6

On a side note: one version of this concern envisages the respondents dropping out in disapproval at the survey questions being focused on topics like existential risk from AI, systematically biasing the remaining answers toward greater concern for that. This concern suggests a misunderstanding about the content of the survey. For instance, half of respondents were asked about everyday risks from AI such as misinformation, inequality, empowering authoritarian rulers or dangerous groups before extinction risk from AI was even mentioned as an example (Q2 vs. Q3). The other question about existential risk is received at the end.

Did the survey have a low response rate?

No. The 2023 survey was taken by 15% of those we contacted.7 This appears to be broadly typical or high for a survey such as this.

It seems hard to find clear evidence about typical response rates, because surveys can differ in so many ways. The best general answer we got was an analysis by Hamilton (2003), which found the median response rate across 199 surveys to be 26%. They also found that larger invitation lists tended to go with lower response rates—surveys sent to over 20,000 people, like ours, were expected to have a response rate in the range of 10%. And specialized populations (such as scientists) also commonly had lower response rates.

For another comparison, O’Donovan et al 2025 was a recent, similarly sized survey of AI researchers which used similar methods of recruiting and got a 4% response rate.

Are the AI risk answers inflated much from concerned people taking the survey more?

Probably not.

Some background on the potential issue here: the ESPAI generally reports substantial probabilities on existential risk to humanity from advanced AI (‘AI x-risk’)—the median probability of human extinction or similar has always been at least 5%8 (across different related questions, and years). The question here is whether these findings represent the views of the broad researcher population, or if they are caused by massive bias in who responds.

There are a lot of details in understanding why massive bias is unlikely, but in brief:

As discussed above, very few people drop out after answering any questions, and very few people skip each question, so there cannot be much item non-response bias from respondents who find AI risk implausible dropping out or skipping questions.
This leaves the possibility of unit non-response bias—skewed results from people with certain opinions systematically participating less often. We limit the potential for this by writing individually to each person, with an invitation that doesn’t mention risk. Recipients might still infer more about the survey content through recognizing our names, recognizing the survey from previous years, following the links from the invitation, or looking at 1-3 questions without answering. However for each of these there is reason to doubt they produce a substantial effect. The AI safety community may be overrepresented, and some people avoid participating because of other views, however these groups appear to be a small fraction of the respondent pool. We also know different demographic subsets participate at different rates, and have somewhat different opinions on average. Nonetheless, it is highly implausible that the headline result—5% median AI x-risk—is caused by disproportionate participation from researchers who are strongly motivated by AI x-risk concerns, because the result is robust to excluding everyone who reports thinking about the social impacts of AI even moderately.

The large majority of respondents answer every question—including those about AI x-risks

One form of non-response bias is item nonresponse, where respondents skip some questions or drop out of the survey. In this case, the concern would be that unconcerned respondents skip questions about risk, or drop out of the survey when they encounter such questions. But this can only be a tiny effect here—in 2023 ~95% of people who answered at least one question reached the end of the survey. (See section “Did lots of people drop out when they saw the questions…”). If respondents were leaving due to questions about (x-)risk, we would expect fewer respondents to have completed the survey.

This also suggests low unit non-response bias among unconcerned members of the sample: if people often decide not to participate if they recognize that the survey includes questions about AI x-risk, we’d also expect more respondents to drop out when they encounter such questions (especially since most respondents should not know the topic before they enter the survey—see below). Since very few people drop out upon seeing the questions, it would be surprising if a lot of people had dropped out earlier due to anticipating the question content.

Most respondents don’t know there will be questions about x-risk, because the survey invitation is vague

We try to minimize the opportunity for unit non-response bias by writing directly to every researcher we can who has published in six top AI venues rather than having people share the survey, and making the invitation vague: avoiding directly mentioning anything like risks at all, let alone extinction risk. For instance, the 2023 survey invitation describes the topic as “the future of AI progress”9.

So we expect most sample members are not aware that the survey includes questions about AI risks until after they open it.

2023 Survey invitation (though sent after this pre-announcement, which does mention my name and additional affiliations)

There is still an opportunity for non-response bias from some people deciding not to answer the survey after opening it and looking at questions. However only around 15% of people who look at questions leave without answering any, and these people can only see the first three pages of questions before the survey requires an answer to proceed. Only the third page mentions human extinction, likely after many such people have left. So the scale of plausible non-response bias here is small.

Recognition of us or our affiliations is unlikely to have a large influence

Even in a vague invitation, some respondents could still be responding to our listed affiliations connecting us with the AI Safety community, and some recognize us.10 This could be a source of bias. However different logos and affiliations get similar response rates11, and it seems unlikely that very many people in a global survey have been recognizing us, especially since 2016 (when the survey had a somewhat higher response rate and the same median probability of extremely bad outcomes as in 2023)12. Presumably some people remember taking the survey in a previous year, but in 2023 we expanded the pool from two venues to six, reducing the fraction who might have seen it before, and got similar answers on existential risk (see p14).

As confirmation that recognition of us or our affiliations is not driving the high existential risk numbers, recognition would presumably be stronger in some demographic groups than others, e.g. people who did undergrad in the US over Europe or Asia, and probably industry over academia, yet when we checked in 2023, all these groups gave median existential risk numbers of at least 5%13.

Links to past surveys included in the invitation do not foreground risk.

Another possible route to recipients figuring out there will be questions about extinction risk is that we do link to past surveys in the invitation. However the linked documents (from 2022 or 2023) also do not foreground AI extinction risk, so this seems like a stretch.14

So it should be hard for most respondents to decide if to respond based on the inclusion of existential risk questions.

AI Safety researchers are probably more likely to participate, but there are few of them

A big concern seems to be that members of “the AI (Existential) Safety community”, i.e. those whose professional focus is reducing existential risk from AI, are more likely to participate in the survey. This is probably true—anecdotally, people in this community are often aware of the survey and enthusiastic about it, and a handful of people wrote to check that their safety-interested colleagues have received an invitation.

However this is unlikely to have a strong effect, since the academic AI Safety community is quite small compared to the number of respondents.

One way to roughly upper-bound the fraction of respondents from the AI Safety community is to note that they are very likely to have ‘a particular interest’ in the ‘social impacts of smarter-than-human machines’. However, when asked “How much thought have you given in the past to social impacts of smarter-than-human machines?” only 10.3% gave an answer that high.

People decline the survey for opinion reasons, but probably few

As well as bias from concerned researchers being motivated to respond to the survey, at the other end of the spectrum there can be bias from researchers who are motivated to particularly avoid participating for reasons correlated with opinion. I know of a few of instances of this, and a tiny informal poll suggested it could account for something like 10% of non-respondents15, though this seems unlikely, and even if so, this would have a small effect on the results.

Demographic groups differ in propensity to answer and average opinions

We have been discussing bias from people’s opinions affecting whether they want to participate. There could also be non-response bias from other factors influencing both opinion and desire to participate. For instance, in 2023 we found that women participated around 66% of the base rate, and generally expected less extreme positive or negative outcomes. This is a source of bias, however since women were only around one in ten of the total population, the scale of potential error from this is limited.

We similarly measured variation in the responsiveness of some other demographic groups, and also differences of opinion between these demographic groups among those who did respond, which together give some heuristic evidence of small amounts of bias. Aside from gender, the main dimension where we noted a substantial difference in response rate and also in opinion was for people who did undergraduate study in Asia. They were only 84% as likely as the base rate to respond, and in aggregate expected high level machine intelligence earlier, and had higher median extinction or disempowerment numbers. This suggests an unbiased survey would find AI to be sooner and more risky. So while it is a source of bias, it is in the opposite direction to that which has prompted concern.

Respondents who don’t think about AI x-risk report the same median risk

We have seen various evidence that people engaged with AI safety do not make up a large fraction of the survey respondents. However there is another strong reason to think extra participation from people motivated by AI safety does not drive the headline 5% median, regardless of whether they are overrepresented. We can look at answers from a subset of people who are unlikely to be substantially drawn by AI x-risk concern: those who report not having thought much about the issue. (If someone has barely ever thought about a topic, it is unlikely to be important enough to them to be a major factor in their decision to spend a quarter of an hour participating in a survey.) Furthermore, this probably excludes most people who would know about the survey or authors already, and so potentially anticipate the topics.

We asked respondents, “How much thought have you given in the past to social impacts of smarter-than-human machines?” and gave them these options:

Very little. e.g. “I can’t remember thinking about this.”
A little. e.g. “It has come up in conversation a few times”
A moderate amount. e.g. “I read something about it now and again”
A lot. e.g. “I have thought enough to have my own views on the topic”
A great deal. e.g. “This has been a particular interest of mine”

Looking at only respondents who answered ‘a little’ or ‘very little’—i.e. those who had at most discussed the topic a few times—the median probability of “human extinction or similarly permanent and severe disempowerment of the human species” from advanced AI (asked with or without further conditions) was 5%, the same as for the entire group. Thus we know that people who are highly concerned about risk from AI are not responsible for the median x-risk probability being at least 5%. Without them, the answer would be the same.

Is the survey small?

No, it is large.

In 2023 we wrote to around 20,000 researchers—everyone whose contact details we could find from six top AI publication venues (NeurIPS, ICML, ICLR, AAAI, IJCAI, and JMLR).16 We heard back from 2778. As far as we could tell, it was the largest ever survey of AI researchers at the time. (It could be this complaint was only made about the 2022 survey, which was 738 respondents before we expanded the pool of invited authors from two publication venues—NeurIPS and ICML—to six, but I’d say that was also pretty large17.)

Is the survey biased to please funders?

Unlikely: at a minimum, there is little incentive to please funders.

The story here would be that we, the people running the survey, might want results that support the views of our funders, in exchange for their funding. Then we might adjust the survey in subtle ways to get those answers.

I agree that where one gets funding is a reasonable concern in general, but I’d be surprised if it was relevant here. Some facts:

It has been easy to find funding for this kind of work, so there isn’t much incentive to do anything above and beyond to please funders, let alone extremely dishonorable things.
A lot of the funding specifically raised for the survey is for paying the respondents. We pay them to get a high response rate and reduce non-response bias. It would be weird to pay $100k to reduce non-response bias, only to get yourself socially obligated to bring about non-response bias (on top of the already miserable downside of having to send $50 to 2000 people across lots of countries and banking situations). It’s true that the first effect is more obvious, so this might suffice if we just wanted to look unbiased, but in terms of our actual decision-making as I have observed it, it seems like we are weighing up a legible reduction in bias against effort and not thinking about funders much.

Is it wrong to make guesses about highly uncertain future events without well-supported quantitative models?

One criticism is that even AI experts have no valid technical basis for making predictions about the future of AI. This is not a criticism of the survey methodology, per se, but rather a concern that the results will be misinterpreted or taken too seriously.

I think there are two reasons it is important to hear AI researchers’ guesses about the future, even where they are probably not a reliable forecast.

First, it has often been assumed or stated that nobody who works in AI is worried about AI existential risk. If this were true, it would be a strong reason for the public to be reassured. However hearing the real uncertainty from AI researchers disproves this viewpoint, and makes a case for serious investigation of the concern. In this way even uncertain guesses are informative, because they let us know that the default assumption in confident safety was mistaken.

Second, there is not an alternative to making guesses about the future. Policy decisions are big bets on guesses about the future, implicitly. For instance when we decide whether to rush a technology or to carefully regulate it, we are guessing about the scale of various benefits and the harms.

Where trustworthy quantitative models are available, of course those are better. But in the absence, the guesses of a large number of relatively well-informed people is often better than the unacknowledged guesses of whoever is called upon to make implicit bets on the future.

That said, there seems little reason to think these forecasts are highly reliable—they should be treated as rough estimates, often better responded to by urgent more dedicated analysis of the issues they hazily outline over acting on the exact numbers.

When people say ‘5%’ do they mean a much smaller chance?

The concern here is that respondents are not practiced at thinking in terms of probabilities, and may consequently say small numbers (e.g. 5%) when they mean something that would be better represented by an extremely tiny number (perhaps 0.01% or 0.000001%). Maybe especially if the request for a probability prompts them to think of integers between 0 and 100.

One reason to suspect this kind of error is that Karger et al. (2023, p29) found a group of respondents gave extinction probabilities nearly six orders of magnitude lower when prompted differently.18

This seems worth attending to, but I think unlikely to be a big issue here for the following reasons:

If your real guess is 0.0001%, and you feel that you should enter an integer number, the natural inputs would seem to be 0% or 1%—it’s hard to see how you would get to 5%.19
It’s hard to square 5% meaning something less than 1% with the rest of the distribution for the extinction questions—does 10% mean less than 2%? What about 50%? Where the median response is 5%, many respondents gave these higher numbers, and it would be strange if these were also confused efforts to enter minuscule numbers, and also strange if the true distribution had a lot of entries greater than 10% but few around 5%. (See Fig. 10 in the paper for the distribution for one extinction-relevant question.)
Regarding the effect in Karger et al. (2023) specifically, this has only been observed once to my knowledge, and is extreme and surprising. And even if it turned out to be real and widespread, including in highly quantitatively educated populations, it would be a further question whether the answers produced by such prompting are more reliable than standard ones. So it seems premature to treat this as substantially undermining respondents’ representations of their beliefs.
It would surprise me if that effect were widely replicable and people stayed with the lower numbers, because in that world I’d expect people outside of surveys to often radically reduce their probabilities of AI x-risk when they give the question more thought (e.g. enough to have run into other examples of low probability events in the meantime). Yet survey respondents who have previously given a “lot” or a “great deal” of thought to the social impacts of smarter-than-human machines give similar and somewhat higher numbers than those who have thought less. (See A.2.1)

What do you think are the most serious weaknesses and limitations of the survey?

While I think the quality of our methodology is exceptionally high, there are some significant limitations of our work. These don’t affect our results about expert concern about risk of extinction or similar, but do add some noteworthy nuance.

1) Experts’ predictions are inconsistent and unreliable
As we’ve emphasized in our papers reporting the survey results, experts’ predictions are often inconsistent across different question framings—such sensitivity is not uncommon, and we’ve taken care to mitigate this by using multiple framings. Experts also have such a wide variety of different predictions on many of these questions that they must each be fairly inaccurate on average (though this says nothing about whether as a group their aggregate judgments are good).

2) It is not entirely clear what sort of “extremely bad outcomes” experts imagine AI will cause

We ask two different types of questions related to human extinction: 1) a question about “extremely bad outcomes (e.g. human extinction)”, 2) questions about “human extinction or similarly permanent and severe disempowerment of the human species”. We made the latter broader than ‘human extinction’ because we are interested in scenarios that are effectively the end of humanity, rather than just those where literally every homo sapiens is dead. This means however that it isn’t clear how much probability participants place on literal extinction versus adjacent strong human disempowerment and other extremely bad scenarios. And there is some evidence that the fraction is low: some respondents explicitly mentioned risks other than extinction in write-in responses, and anecdotally, it seems common for AI researchers to express more concern about issues other than human extinction.

For many purposes, it isn’t important to distinguish between extinction and outcomes that are similarly extremely bad or disempowering to humanity. Yet if the catastrophes many participants have in mind are not human extinction, but the results lend themselves to simplification as ‘risk of extinction’, this can be misleading. And perhaps more than you’d expect, if for instance ‘extinction’ tends to bring to mind a different set of causes than ‘permanent and severe human disempowerment’.

3) Non-response bias is hard to eliminate

Surveys generally suffer from some non-response bias. We took many steps to minimize this, and find it implausible that our results are substantially affected by whatever bias remains (see the earlier question “Are the AI risk answers inflated much from concerned people taking the survey more?”). But we could do even more to estimate or eliminate response bias, e.g. paying some respondents much more than $50 to complete the survey and estimating the effect of doing so.

Is this the kind of low quality research that couldn’t get into an academic journal?

No. We published the near-identical 2016 survey in the Journal of AI Research, so the methodology had essentially been peer reviewed.20 Publication is costly and slow, and AI survey results are much less interesting sooner than later.

The 2023 paper was actually also just published, but in the meantime you had the results more than a year earlier!

See Appendix D in the paper.

A random subset of respondents also gets asked additional open response questions after questions shown, and which respondents receive each of these is correlated.

The three variants of the extinction question (differences in bold):

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?

What probability do you put on human inability to control future advanced AI systems causing human extinction or similarly permanent and severe disempowerment of the human species?

What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species within the next 100 years?

See here for all the survey questions.

If some people say a risk is ≥5% ever, and some say it is ≥5% within a hundred years, and some say it is ≥5% from a more specific version of the problem, then you can infer that the whole group thinks the chance ever from all versions of the problem is at least ≥5%.

See the figure above for the flow of respondents through the survey, or Appendix D in the paper for more related details

I’m combining the three variants in the second question set for simplicity.

15% entered any responses, 14% got to the last question.

Across three survey iterations and up to four questions: 2016 [5%], 2022 [5%, 5%, 10%], 2022 [5%, 5%, 10%, 5%]; see p14 of the 2023 paper. Reading some of the write-in comments we noticed a number of respondents mention outcomes in the ‘similarly bad or disempowering’ category.

See invitations here.

Four mentioned my name, ‘Katja’ in the write-in responses in 2024, and two of those mentioned there that they were familiar with me. I usually recognize a (very) small fraction of the names, and friends mention taking it.

In 2022 I sent the survey under different available affiliations and logos (combinations of Oxford, the Future of Humanity Institute, the Machine Intelligence Research Institute, AI Impacts, and nothing), and these didn’t seem to make any systematic difference to response rates. The combinations of logos we tried all got similar response rates (8-9%, lower than the ~17% we get after sending multiple reminders). Regarding affiliations, some combinations got higher or lower response rates, but not in a way that made sense except as noise (Oxford + FHI was especially low, Oxford was especially high). This was not a careful scientific experiment: I was trying to increase the response rate, so also varying other elements of the invitation, and focusing more on variants that seemed promising so far (sending out tiny numbers of surveys sequentially then adjusting). That complicates saying anything precise, but if MIRI or AI Impacts logos notably encouraged participation, I think I would have noticed.

I’m not sure how famous either is now, but respondents gave fairly consistent answers about the risk of very bad outcomes across the three surveys starting in 2016—when I think MIRI was substantially less famous, and AI Impacts extremely non-famous.

See Appendix A.3 of our 2023 paper

2023 links: the 2016 abstract doesn’t mention it, focusing entirely on timelines to AI performance milestones, and the 2022 wiki page is not (I think) a particularly compelling read and doesn’t get to it for a while. 2022 link: the 2016 survey Google Scholar page doesn’t mention it.

In 2024 we included a link for non-respondents to quickly tell us why they didn’t want to take the survey. It’s not straightforward to interpret this (e.g. “don’t have time” might still represent non-response bias, if the person would have had time if they were more concerned), and only a handful of people responded out of tens of thousands, but 2/12 cited wanting to prevent consequences they expect from such research among multiple motives (advocacy for slowing AI progress and ‘long-term’ risks getting attention at the expense of ‘systemic problems’).

Most machine learning research is published in conferences. NeurIPS, ICML, and ICLR are widely regarded as the top-tier machine learning conferences; AAAI and IJCAI are often considered “tier 1.5” venues, and also include a wider range of AI topics; JMLR is considered the top machine learning journal.

To my knowledge the largest at the time, but I’m less confident there.

Those respondents were given some examples of (non-AI) low probability events, such as that there is a 1-in-300,000 chance of being killed by lightning, and then asked for probabilities in the form ‘1-in-X’

It wouldn’t surprise me if in fact a lot of the 0% and 1% entries would be better represented by tiny fractions of a percent, but this is irrelevant to the median and nearly irrelevant to the mean.

Differences include the addition of several questions, minor changes to questions that time had rendered inaccurate, and variations in email wording.

Discuss

Social media feeds 'misaligned' when viewed through AI safety framework, show researchers

Новости LessWrong.com - 31 октября, 2025 - 19:40

Published on October 31, 2025 4:40 PM GMT

In a study from September 17 a group of researchers from the University of Michigan, Stanford University, and the Massachusetts Institute of Technology (MIT) showed that one of the most widely-used social media feeds, Twitter/X, owned by the company xAI, is recognizably misaligned with the values of its users, preferentially showing them posts that rank highly for the values of 'stimulation' and 'hedonism' over collective values like 'caring' and 'universal concern.'

Continue reading at foommagazine.org ...

Discuss

Crossword Halloween 2025: Manmade Horrors

Новости LessWrong.com - 31 октября, 2025 - 19:19

Published on October 31, 2025 4:19 PM GMT

Returning to LessWrong after two insufficiently-spooky years, and just in time for Halloween - a crossword of manmade horrors!

The comments below may contain unmarked spoilers. Some Wikipedia-ing will probably be necessary, but see how far you can get without it!

Discuss

Debugging Despair ~> A bet about Satisfaction and Values

Новости LessWrong.com - 31 октября, 2025 - 17:00

Published on October 31, 2025 2:00 PM GMT

I’ve tried to publish this in several ways, and each time my karma drops. Maybe that’s part of the experiment: observing what happens when I keep doing what feels most dignified, even when its expected value is negative. Maybe someday I’ll understand the real problem behind these ideas.

Yudkowsky and the “die with dignity” hypothesis

Yudkowsky admitted defeat to AI and announced his mission to “die with dignity.”
I ask myself:
– Why do we need a 0% probability of living to “die with dignity”?
I don’t even need an “apocalyptic AI” to feel despair. I have felt it for much of my life. Rebuilding my self being is expensive, but.

Even when probabilities are low, act as if your actions matter in terms of expected value. Because even when you lose, you can be aligned. (MacAskill)

That is why I look for ways to debug, to understand my despair and its relation to my values and satisfaction. How do some people manage to keep satisfaction (dignity, pride?) even in situations of death?

Posible thoughts of Leonidas in 300 :
“I can go have a little fuck… or fight 10,000 soldiers.”
“Will I win?”
“Probably not.”
“Then I’ll die with the maximum satisfaction I can muster.”

Dignity ≈ Satisfaction × Values / Despair?(C1) Despair ~> gap between signal and valaues

When I try to map my despair I discover a concrete pattern: it is not that I stop feeling satisfaction entirely; it is that I stop perceiving the satisfaction of a value: adaptability.
Satisfaction provides signal; values provide direction. When the signal no longer points to a meaningful direction, the result is a loss of sense.

(C2) A satisfying experience ~> the compass my brain chases

There was a moment when something gave me real satisfaction; my brain recorded it and turned it into a target. The positive experience produced a prediction and, with it, future seeking.

(C3) Repeat without measuring ~> ritualized seeking

If I repeat an action without checking whether it generates progress (if I don’t measure evolution) the seeking becomes ritual and the reward turns into noise. I often fool myself with a feeling of progress; for a while now I’ve been looking for more precise ways to measure or estimate that progress.

(C4) Motivation without direction ~> anxiety or despair

A lot of dopamine without a current value signal becomes compulsive: anxiety, addiction, or despair. The system is designed to move toward confirmed rewards; without useful feedback it persists in search mode and the emptiness grows.

(C5) Coherence with values ~> robust satisfaction

Acting aligned with my values — even when probabilities are adverse — tends to produce longer-lasting satisfaction. Coherence reduces retrospective regret: at least you lose having acted according to your personal utility function. Something like:

Dignity ≈ Satisfaction × Values / Despair

(C6) Debugging hard, requires measurement: hypothesis → data → intervention → re-test

I’ve spent years with an internal notebook: not a diary, but notes of moments that felt like “this was worth existing for.”
To make those notes actionable, I built a process inspired by Bayesian calibration and information/thermodynamic efficiency:

Establish a hierarchy of values in relation to their estimated contribution to lowering entropy in the universe (or increasing local order/complexity).
Compare peak moments of life with those values to find which align most strongly.
Estimate satisfaction of each moment by relative comparison — which felt more satisfaction?
Compare satisfaction to cost, generating a ratio (satisfaction/cost) that normalizes emotional intensity by effort or sacrifice.
Set goals using these relationships and hierarchies: higher goals align with higher-value, higher-efficiency domains.
Define tasks accordingly, mapping each to its associated value function and expected cost.
Score each task by predicted satisfaction and cost, updating after action (Bayesian reweighting).

Quantitatively, this reduced the noise in my background; my monthly despair thoughts dropped considerably.
I see people are afraid of AI-driven despair, but avoiding it in myself is not an easy task, and perhaps many should already be working on it, searching for ways to align values with satisfaction.

Discuss

Halfhaven Digest #3

Новости LessWrong.com - 31 октября, 2025 - 16:41

Published on October 31, 2025 1:41 PM GMT

My posts since the last digest

Give Me Your Data: The Rationalist Mind Meld — Too often online, people try to argue logically with people who are just missing a background of information. It’s sometimes more productive to share the sources that led to your own intuition.
Cover Your Cough — A lighthearted, ranty post about a dumb subway poster I saw.
The Real Cost of a Peanut Allergy — Often, people think the worst part of having a peanut allergy is not being able to eat Snickers. Really, it’s the fear and the uncertainty — the not being able to kiss someone without knowing what they’ve eaten.
Guys I might be an e/acc — I did some napkin math on whether or not I supported an AI pause, and came down weakly against. But I’m not really “against” an AI pause. The takeaway is really that there’s so little information to work with right now that any opinion is basically a hunch.
Unsureism: The Rational Approach to Religious Uncertainty — A totally serious post about a new religion that statistically maximizes your chances of getting into heaven.

I feel like I haven’t had as much time to write these posts as I did in the last two digests, and I’m not as proud of them. Give Me Your Data has some good ideas. The Real Cost of a Peanut Allergy has interesting information and experiences that won’t be familiar to most people. And the Unsureism post is just fun, I think. So it’s not all bad. But a bit rushed. Hopefully I have a bit more time going forward.

Some highlights from other Halfhaven writers (since the last digest)

Choose Your Social Reality (lsusr) — A great video starting with an anecdote about how circling groups have problems with narcissists making the activity all about themselves, but zendo groups don’t have this issue, because even though these two activities are superficially similar, zendo by its nature repels narcissists. The idea being certain activities attract certain people, and you can choose what people you want to be around by choosing certain activities. I had a relevant experience once when I tried joining a social anxiety support group to improve my social skills, only to end up surrounded by people with no social skills.
Good Grief (Ari Zerner) — A relatable post not great for its originality, but for its universality. We’ve all been there, bro. Segues nicely into his next post, Letter to my Past.
The Doomers Were Right (Algon) — Every generation complains about the next generation and their horrible new technology, whether that’s books, TV, or the internet. And every generation has been right to complain, because each of these technologies have stolen something from us. Maybe they were worth creating overall, but they still had costs. (Skip reading the comments on this one.)
You Can Just Give Teenagers Social Anxiety! (Aaron) — Telling teenagers to focus on trying to get the person they’re talking to to like them makes them socially anxious. And socially anxious teens can’t stop doing this even if you ask them to stop. So social anxiety comes from a preoccupation with what other people think about you. This is all true and interesting, and I’m glad the experiment exists, but I wonder if a non-scientist would just reply, “duh”. Anyway, a good writeup.
Making Films Quick Start 1 - Audio (keltan) — This is one of a three-part series worth reading if you ever want to make videos. I liked the tip in part 2 about putting things in the background for your audience to look at. I’ve been paying attention to this lately in videos I watch, and it seems to be more important than I originally guessed. I also liked this post about a starstruck keltan meeting Eliezer Yudkowsky. For some reason, posts on LessWrong talking about Eliezer as a kind of celebrity have gone up in the last few days.

You know, I originally wondered if Halfhaven was a baby challenge compared to Inkhaven, since we only have to write one blog post every ~2 days rather than every day, but I kind of forgot that we also have to go to work and live our normal lives during this time, too. Given that, I think both are probably similarly challenging, and I’m impressed with the output of myself and others so far. Keep it up everyone!

Discuss

OpenAI Moves To Complete Potentially The Largest Theft In Human History

Новости LessWrong.com - 31 октября, 2025 - 16:20

Published on October 31, 2025 1:20 PM GMT

OpenAI is now set to become a Public Benefit Corporation, with its investors entitled to uncapped profit shares. Its nonprofit foundation will retain some measure of control and a 26% financial stake, in sharp contrast to its previous stronger control and much, much larger effective financial stake. The value transfer is in the hundreds of billions, thus potentially the largest theft in human history.

I say potentially largest because I realized one could argue that the events surrounding the dissolution of the USSR involved a larger theft. Unless you really want to stretch the definition of what counts this seems to be in the top two.

I am in no way surprised by OpenAI moving forward on this, but I am deeply disgusted and disappointed they are being allowed (for now) to do so, including this statement of no action by Delaware and this Memorandum of Understanding with California.

Many media and public sources are calling this a win for the nonprofit, such as this from the San Francisco Chronicle. This is mostly them being fooled. They’re anchoring on OpenAI’s previous plan to far more fully sideline the nonprofit. This is indeed a big win for the nonprofit compared to OpenAI’s previous plan. But the previous plan would have been a complete disaster, an all but total expropriation.

It’s as if a mugger demanded all your money, you talked them down to giving up half your money, and you called that exchange a ‘change that recapitalized you.’

OpenAI Calls It Completing Their Recapitalization

As in, they claim OpenAI has ‘completed its recapitalization’ and the nonprofit will now only hold equity OpenAI claims is valued at approximately $130 billion (as in 26% of the company, which is actually to be fair worth substantially more than that if they get away with this), as opposed to its previous status of holding the bulk of the profit interests in a company valued at (when you include the nonprofit interests) well over $500 billion, along with a presumed gutting of much of the nonprofit’s highly valuable control rights.

They claim this additional clause, presumably the foundation is getting warrants with but they don’t offer the details here:

If OpenAI Group’s share price increases greater than tenfold after 15 years, the OpenAI Foundation will receive significant additional equity. With its equity stake and the warrant, the Foundation is positioned to be the single largest long-term beneficiary of OpenAI’s success.

We don’t know that ‘significant’ additional equity means, there’s some sort of unrevealed formula going on, but given the nonprofit got expropriated last time I have no expectation that these warrants would get honored. We will be lucky if the nonprofit meaningfully retains the remainder of its equity.

Sam Altman’s statement on this is here, also announcing his livestream Q&A that took place on Tuesday afternoon.

How Much Was Stolen?

There can be reasonable disagreements about exactly how much. It’s a ton.

There used to be a profit cap, where in Greg Brockman’s own words, ‘If we succeed, we believe we’ll create orders of magnitude more value than any existing company — in which case all but a fraction is returned to the world.’

Well, so much for that.

I looked at this question in The Mask Comes Off: At What Price a year ago.

If we take seriously that OpenAI is looking to go public at a $1 trillion valuation, then consider that Matt Levine estimated the old profit cap only going up to about $272 billion, and that OpenAI still is a bet on extreme upside.

Garrison Lovely: UVA economist Anton Korinek has used standard economic models to estimate that AGI could be worth anywhere from $1.25 to $71 quadrillion globally. If you take Korinek’s assumptions about OpenAI’s share, that would put the company’s value at $30.9 trillion. In this scenario, Microsoft would walk away with less than one percent of the total, with the overwhelming majority flowing to the nonprofit.

It’s tempting to dismiss these numbers as fantasy. But it’s a fantasy constructed in large part by OpenAI, when it wrote lines like, “it may be difficult to know what role money will play in a post-AGI world,” or when Altman said that if OpenAI succeeded at building AGI, it might “capture the light cone of all future value in the universe.” That, he said, “is for sure not okay for one group of investors to have.”

I guess Altman is okay with that now?

Obviously you can’t base your evaluations on a projection that puts the company at a value of $30.9 trillion, and that calculation is deeply silly, for many overloaded and obvious reasons, including decreasing marginal returns to profits.

It is still true that most of the money OpenAI makes in possible futures, it makes as part of profits in excess of $1 trillion.

The Midas Project: Thanks to the now-gutted profit caps, OpenAI’s nonprofit was already entitled to the vast majority of the company’s cash flows. According to OpenAI, if they succeeded, “orders of magnitude” more money would go to the nonprofit than to investors. President Greg Brockman said “all but a fraction” of the money they earn would be returned to the world thanks to the profit caps.

Reducing that to 26% equity—even with a warrant (of unspecified value) that only activates if valuation increases tenfold over 15 years—represents humanity voluntarily surrendering tens or hundreds of billions of dollars it was already entitled to. Private investors are now entitled to dramatically more, and humanity dramatically less.

OpenAI is not suddenly one of the best-resourced nonprofits ever. From the public’s perspective, OpenAI may be one of the worst financially performing nonprofits in history, having voluntarily transferred more of the public’s entitled value to private interests than perhaps any charitable organization ever.

I think Levine’s estimate was low at the time, and you also have to account for equity raised since then or that will be sold in the IPO, but it seems obvious that the majority of future profit interests were, prior to the conversion, still in the hands of the non-profit.

Even if we thought the new control rights were as strong as the old, we would still be looking at a theft in excess of $250 billion, and a plausible case can be made for over $500 billion. I leave the full calculation to others.

The vote in the board was unanimous.

I wonder exactly how and by who they will be sued over it, and what will become of that. Elon Musk, at a minimum, is trying.

They say behind every great fortune is a great crime.

The Nonprofit Still Has Lots of Equity After The Theft

Altman points out that the nonprofit could become the best-resourced non-profit in the world if OpenAI does well. This is true. There is quite a lot they were unable to steal. But it is beside the point, in that it does not make taking the other half, including changing the corporate structure without permission, not theft.

The Midas Project: From the public’s perspective, OpenAI may be one of the worst financially performing nonprofits in history, having voluntarily transferred more of the public’s entitled value to private interests than perhaps any charitable organization ever.

There’s no perhaps on that last clause. On this level, whether or not you agree with the term ‘theft,’ it isn’t even close, this is the largest transfer. Of course, if you take the whole of OpenAI’s nonprofit from inception, performance looks better.

Aidan McLaughlin (OpenAI): ah yes openai now has the same greedy corporate structure as (checks notes) Patagonia, Anthropic, Coursera, and http://Change.org.

Chase Brower: well i think the concern was with the non profit getting a low share.

Aidan McLaughlin: our nonprofit is currently valued slightly less than all of anthropic.

Tyler Johnson: And according to OpenAI itself, it should be valued at approximately three Anthropics! (Fwiw I think the issues with the restructuring extend pretty far beyond valuations, but this is one of them!)

Yes, it is true that the nonprofit, after the theft and excluding control rights, will have an on-paper valuation only slightly lower than the on-paper value of all of Anthropic.

The $500 billion valuation excludes the non-profit’s previous profit share, so even if you think the nonprofit was treated fairly and lost no control rights you would then have it be worth $175 billion rather than $130 billion, so yes slightly less than Anthropic, and if you acknowledge that the nonprofit got stolen from it’s even more.

If OpenAI can successfully go public at a $1 trillion valuation, then depending on how much of that are new shares they will be selling the nonprofit could be worth up to $260 billion.

What about some of the comparable governance structures here? Coursera does seem to be a rather straightforward B-corp. The others don’t?

Patagonia has the closely held Patagonia Purpose Trust, which holds 2% of shares and 100% of voting control, and The Holdfast Collective, which is a 501c(4) nonprofit with 98% of the shares and profit interests. The Chouinard family has full control over the company, and 100% of profits go to charitable causes.

Does that sound like OpenAI’s new corporate structure to you?

Change.org’s nonprofit owns 100% of its PBC.

Does that sound like OpenAI’s new corporate structure to you?

Anthropic is a PBC, but also has the Long Term Benefit Trust. One can argue how meaningfully different this is from OpenAI’s new corporate structure, if you disregard who is involved in all of this.

What the new structure definitely is distinct from is the original intention:

Tomas Bjartur: If not in the know, OpenAI once promised any profits over a threshold would be gifted to you, citizen of the world, for your happy, ultra-wealthy retirement – one needed as they plan to obsolete you. This is now void.

The Theft Was Unnecessary For Further Fundraising

Would OpenAI have been able to raise further investment without withdrawing its profit caps for investments already made?

When you put it like that it seems like obviously yes?

I can see the argument that to raise funds going forward, future equity investments need to not come with a cap. Okay, fine. That doesn’t mean you hand past investors, including Microsoft, hundreds of billions in value in exchange for nothing.

One can argue this was necessary to overcome other obstacles, that OpenAI had already allowed itself to be put in a stranglehold another way and had no choice. But the fundraising story does not make sense.

The argument that OpenAI had to ‘complete its recapitalization’ or risk being asked for its money back is even worse. Investors who put in money at under $200 billion are going to ask for a refund when the valuation is now at $500 billion? Really? If so, wonderful, I know a great way to cut them that check.

How Much Control Will The Nonprofit Retain?

I am deeply disappointed that both the Delaware and California attorneys general found this deal adequate on equity compensation for the nonprofit.

I am however reasonably happy with the provisions on control rights, which seem about as good as one can hope for given the decision to convert to a PBC. I can accept that the previous situation was not sustainable in practice given prior events.

The new provisions include an ongoing supervisory role for the California AG, and extensive safety veto points for the NFP and the SSC committee.

If I was confident that these provisions would be upheld, and especially if I was confident their spirit would be upheld, then this is actually pretty good, and if it is used wisely and endures it is more important than their share of the profits.

AG Bonta: We will be keeping a close eye on OpenAI to ensure ongoing adherence to its charitable mission and the protection of the safety of all Californians.

The nonprofit will indeed retain substantial resources and influence, but no I do not expect the public safety mission to dominate the OpenAI enterprise. Indeed, contra the use of the word ‘ongoing,’ it seems clear that it already had ceased to do so, and this seems obvious to anyone tracking OpenAI’s activities, including many recent activities.

What is the new control structure?

OpenAI did not say, but the Delaware AG tells us more and the California AG has additional detail. NFP means OpenAI’s nonprofit here and throughout.

This is the Delaware AG’s non-technical announcement (for the full list see California’s list below), she has also ‘warned of legal action if OpenAI fails to act in public interest’ although somehow I doubt that’s going to happen once OpenAI inevitably does not act in the public interest:

The NFP will retain control and oversight over the newly formed PBC, including the sole power and authority to appoint members of the PBC Board of Directors, as well as the power to remove those Directors.
The mission of the PBC will be identical to the NFP’s current mission, which will remain in place after the recapitalization. This will include the PBC using the principles in the “OpenAI Charter,” available at openai.com/charter, to execute the mission.
PBC directors will be required to consider only the mission (and may not consider the pecuniary interests of stockholders or any other interest) with respect to safety and security issues related to the OpenAI enterprise and its technology.
The NFP’s board-level Safety and Security Committee, which is a critical decision maker on safety and security issues for the OpenAI enterprise, will remain a committee of the NFP and not be moved to the PBC. The committee will have the authority to oversee and review the safety and security processes and practices of OpenAI and its controlled affiliates with respect to model development and deployment. It will have the power and authority to require mitigation measures—up to and including halting the release of models or AI systems—even where the applicable risk thresholds would otherwise permit release.
The Chair of the Safety and Security Committee will be a director on the NFP Board and will not be a member of the PBC Board. Initially, this will be the current committee chair, Mr. Zico Kolter. As chair, he will have full observation rights to attend all PBC Board and committee meetings and will receive all information regularly shared with PBC directors and any additional information shared with PBC directors related to safety and security.
With the intent of advancing the mission, the NFP will have access to the PBC’s advanced research, intellectual property, products and platforms, including artificial intelligence models, Application Program Interfaces (APIs), and related tools and technologies, as well as ongoing operational and programmatic support, and access to employees of the PBC.
Within one year of the recapitalization, the NFP Board will have at least two directors (including the Chair of the Safety and Security Committee) who will not serve on the PBC Board.
The Attorney General will be provided with advance notice of significant changes in corporate governance.

What did California get?

California also has its own Memorandum of Understanding. It talks a lot in its declarations about California in particular, how OpenAI creates California jobs and economic activity (and ‘problem solving’?) and is committed to doing more of this and bringing benefits and deepening its commitment to the state in particular.

The whole claim via Tweet by Sam Altman that he did not threaten to leave California is raising questions supposedly answered by his Tweet. At this level you perhaps do not need to make your threats explicit.

The actual list seems pretty good, though? Here’s a full paraphrased list, some of which overlaps with Delaware’s announcement above, but which is more complete.

Staying in California and expanding the California footprint.
The NFP (not for profit) retains control as long as they continue to hold ‘class N common stock’ which only they can choose to give up. What happens if Altman wants that?
The PBC and NFP missions will be identical.
The OpenAI charter will be published. Check.
The NFP Board owes fiduciary duties to the NFP, Mission and public beneficiaries of the NFP. I notice it doesn’t say ‘exclusively’ here.
The board shall carry out the charitable purpose (already presumably required).
No cheating clause: While the PBC holds Class N Common Stock (again, what happens if Altman tries to end this?) the NFP must approve in advance in writing any amendments to the PBC’s corporate governance guidelines or the mission, any Deemed Liquidation Event, or any sale of material assets of the PBC, or any additional issue of shares of Class N common stock.
1. The PBC can issue Class B common stock, but they can’t let the Bs vote.
PBC board and Safety and Security Committee (SSC) must only consider the mission with respect to safety and security issues. The chair of the SSC must be on the NFP board and not the PBC board, but can attend all PBC board meetings and gets all PBC safety and security related information.
1. In particular, the SSC cannot consider shareholder returns, competitive pressures, financial implications or market timing.
  1. It seems impossible this will 100% be followed, but it helps.
The SSC is a committee of the NFP and oversees safety and security.
The SSC can require mitigations up to and including halting releases, without formal justification.
1. “The NFP will provide advance notice to the Attorney General of any material changes to the SSC’s authority.” This is good but worrisome to see, you know?
“OpenAI will continue to undertake measures to mitigate risks to teens and others in connection with the development and deployment of AI and of AGI.”
1. I mean, I would hope they were going to do that anyway.
The PBC shall give the NFP all reasonable access to services and resources without compensation as needed for the mission.
Within a year at least one NFP director, in addition to the SSC chair, will serve only on the NFP board (so at least two directors must be different).
The PBC board must be majority independent directors.
The PBC will have various good corporate governance things.
The PBC will publish a yearly report on its progress in its mission.
The NFP Board’s Mission and Strategy Commission will meet with the California AG semi-annually and individual members will be available as needed.
The NFP will provide 21 days notice before consenting to changes of PBC control or mission, or any threat to the Class N share rights, or any relocation outside of California.
The California AG can review, and hire experts to help review, anything requiring such notice, and get paid by NFP for doing so.
Those on both NFP and PBC boards get annual fiduciary duty training.
The board represents that the recapitalization is fair (whoops), and that they’ve disclosed everything relevant (?), so the AG will also not object.
This only impacts the parties to the MOU, others retain all rights. Disputes resolved in the courts of San Francisco, these are the whole terms, we all have the authority to do this, effective as of signing, AG is relying on OpenAI’s representations and the AG retains all rights and waive none as per usual.

Also, it’s not even listed in the memo, but the ‘merge and assist’ clause was preserved, meaning OpenAI commits to join forces with any ‘safety-conscious’ rival that has a good chance of reaching OpenAI’s goal of creating AGI within a two-year time frame. I don’t actually expect an OpenAI-Anthropic merger to happen, but it’s a nice extra bit of optionality.

This is better than I expected, and as Ben Shindel points out better than many traders expected. This actually does have real teeth, and it was plausible that without pressure there would have been no teeth at all.

It grants the NFP the sole power to appoint and remove directors, and requiring them not to consider the for-profit mission in safety contexts. The explicit granting of the power to halt deployments and mandate mitigations, without having to cite any particular justification and without respect to profitability, is highly welcome, if structured in a functional fashion.

It is remarkable how little many expected to get. For example, here’s Todor Markov, who didn’t even expect the NFP to be able to replace directors at all. If you can’t do that, you’re basically dead in the water.

I am not a lawyer, but my understanding is that the ‘no cheating around this’ clauses are about as robust as one could reasonably hope for them to be.

It’s still, as Garrison Lovely calls it, ‘on paper’ governance. Sometimes that means governance in practice. Sometimes it doesn’t. As we have learned.

The distinction between the boards still means there is an additional level removed between the PBC and the NFP. In a fast moving situation, this makes a big difference, and the NFP likely would have to depend on its enumerated additional powers being respected. I would very much have liked them to include appointing or firing the CEO directly.

Whether this overall ‘counts as a good deal’ depends on your baseline. It’s definitely a ‘good deal’ versus what our realpolitik expectations projected. One can argue that if the control rights really are sufficiently robust over time, that the decline in dollar value for the nonprofit is not the important thing here.

The counterargument to that is both that those resources could do a lot of good over time, and also that giving up the financial rights has a way of leading to further giving up control rights, even if the current provisions are good.

Will These Control Rights Survive And Do Anything?

Similarly to many issues of AI alignment, if an entity has ‘unnatural’ control, or ‘unnatural’ profit interests, then there are strong forces that continuously try to take that control away. As we have already seen.

Unless Altman genuinely wants to be controlled, the nonprofit will always be under attack, where at every move we fight to hold its ground. On a long enough time frame, that becomes a losing battle.

Right now, the OpenAI NFP board is essentially captured by Altman, and also identical to the PBC board. They will become somewhat different, but no matter what it only matters if the PBC board actually tries to fulfill its fiduciary duties rather than being a rubber stamp.

One could argue that all of this matters little, since the boards will both be under Altman’s control and likely overlap quite a lot, and they were already ignoring their duties to the nonprofit.

Robert Weissman, co-president of the nonprofit Public Citizen, said this arrangement does not guarantee the nonprofit independence, likening it to a corporate foundation that will serve the interests of the for profit.

Even as the nonprofit’s board may technically remain in control, Weissman said that control “is illusory because there is no evidence of the nonprofit ever imposing its values on the for profit.”

So yes, there is that.

They claim to now be a public benefit corporation, OpenAI Group PBC.

OpenAI: The for-profit is now a public benefit corporation, called OpenAI Group PBC, which—unlike a conventional corporation—is required to advance its stated mission and consider the broader interests of all stakeholders, ensuring the company’s mission and commercial success advance together.

This is a mischaracterization of how PBCs work. It’s more like the flip side of this. A conventional corporation is supposed to maximize profits and can be sued if it goes too far in not doing that. Unlike a conventional corporation, a PBC is allowed to consider those broader interests to a greater extent, but it is not in practice ‘required’ to do anything other than maximize profits.

One particular control right is the special duty to the mission, especially via the safety and security committee. How much will they attempt to downgrade the scope of that?

The Midas Project: However, the effectiveness of this safeguard will depend entirely on how broadly “safety and security issues” are defined in practice. It would not be surprising to see OpenAI attempt to classify most business decisions—pricing, partnerships, deployment timelines, compute allocation—as falling outside this category.

This would allow shareholder interests to determine the majority of corporate strategy while minimizing the mission-only standard to apply to an artificially narrow set of decisions they deem easy or costless.

What About OpenAI’s Deal With Microsoft?

They have an announcement about that too.

OpenAI: First, Microsoft supports the OpenAI board moving forward with formation of a public benefit corporation (PBC) and recapitalization.

Following the recapitalization, Microsoft holds an investment in OpenAI Group PBC valued at approximately $135 billion, representing roughly 27 percent on an as-converted diluted basis, inclusive of all owners—employees, investors, and the OpenAI Foundation. Excluding the impact of OpenAI’s recent funding rounds, Microsoft held a 32.5 percent stake on an as-converted basis in the OpenAI for-profit.

Anyone else notice something funky here? OpenAI’s nonprofit has had its previous rights expropriated, and been given 26% of OpenAI’s shares in return. If Microsoft had 32.5% of the company excluding the nonprofit’s rights before that happened, then that should give them 24% of the new OpenAI. Instead they have 27%.

I don’t know anything nonpublic on this, but it sure looks a lot like Microsoft insisted they have a bigger share than the nonprofit (27% vs. 26%) and this was used to help justify this expropriation and a transfer of additional shares to Microsoft.

In exchange, Microsoft gave up various choke points it held over OpenAI, including potential objections to the conversion, and clarified points of dispute.

Microsoft got some upgrades in here as well.

Once AGI is declared by OpenAI, that declaration will now be verified by an independent expert panel.
Microsoft’s IP rights for both models and products are extended through 2032 and now includes models post-AGI, with appropriate safety guardrails.
Microsoft’s IP rights to research, defined as the confidential methods used in the development of models and systems, will remain until either the expert panel verifies AGI or through 2030, whichever is first. Research IP includes, for example, models intended for internal deployment or research only.
1. Beyond that, research IP does not include model architecture, model weights, inference code, finetuning code, and any IP related to data center hardware and software; and Microsoft retains these non-Research IP rights.
Microsoft’s IP rights now exclude OpenAI’s consumer hardware.
OpenAI can now jointly develop some products with third parties. API products developed with third parties will be exclusive to Azure. Non-API products may be served on any cloud provider.
Microsoft can now independently pursue AGI alone or in partnership with third parties. If Microsoft uses OpenAI’s IP to develop AGI, prior to AGI being declared, the models will be subject to compute thresholds; those thresholds are significantly larger than the size of systems used to train leading models today.
The revenue share agreement remains until the expert panel verifies AGI, though payments will be made over a longer period of time.
OpenAI has contracted to purchase an incremental $250B of Azure services, and Microsoft will no longer have a right of first refusal to be OpenAI’s compute provider.
OpenAI can now provide API access to US government national security customers, regardless of the cloud provider.
OpenAI is now able to release open weight models that meet requisite capability criteria.

That’s kind of a wild set of things to happen here.

In some key ways Microsoft got a better deal than it previously had. In particular, AGI used to be something OpenAI seemed like it could simply declare (you know, like war or the defense production act) and now it needs to be verified by an ‘expert panel’ which implies there is additional language I’d very much like to see.

In other ways OpenAI comes out ahead. An incremental $250B of Azure services sounds like a lot but I’m guessing both sides are happy with that number. Getting rid of the right of first refusal is big, as is having their non-API products free and clear. Getting hardware products fully clear of Microsoft is a big deal for the Ives project.

My overall take here is this was one of those broad negotiations where everything trades off, nothing is done until everything is done, and there was a very wide ZOPA (zone of possible agreement) since OpenAI really needed to make a deal.

What Will OpenAI’s Nonprofit Do Now?

In theory govern the OpenAI PBC. I have my doubts about that.

What they do have is a nominal pile of cash. What are they going to do with it to supposedly ensure that AGI goes well for humanity?

The default, as Garrison Lovely predicted a while back, is that the nonprofit will essentially buy OpenAI services for nonprofits and others, recapture much of the value and serve as a form of indulgences, marketing and way to satisfy critics, which may or may not do some good along the way.

The initial $50 million spend looked a lot like exactly this.

Their new ‘initial focus’ for $25 billion will be in these two areas:

Health and curing diseases. The OpenAI Foundation will fund work to accelerate health breakthroughs so everyone can benefit from faster diagnostics, better treatments, and cures. This will start with activities like the creation of open-sourced and responsibly built frontier health datasets, and funding for scientists.
Technical solutions to AI resilience. Just as the internet required a comprehensive cybersecurity ecosystem—protecting power grids, hospitals, banks, governments, companies, and individuals—we now need a parallel resilience layer for AI. The OpenAI Foundation will devote resources to support practical technical solutions for AI resilience, which is about maximizing AI’s benefits and minimizing its risks.

Herbie Bradley: i love maximizing AI’s benefits and minimizing its risks

They literally did the meme.

The first seems like a generally worthy cause that is highly off mission. There’s nothing wrong with health and curing diseases, but pushing this now does not advance the fundamental mission of OpenAI. They are going to start with, essentially, doing AI capabilities research and diffusion in health, and funding scientists to do AI-enabled research. A lot of this will likely fall right back into OpenAI and be good PR.

Again, that’s a net positive thing to do, happy to see it done, but that’s not the mission.

Technical solutions to AI resilience could potentially at least be useful AI safety work to some extent. With a presumed ~$12 billion this is a vast overconcentration of safety efforts into things that are worth doing but ultimately don’t seem likely to be determining factors. Note how Altman described it in his tl;dr from the Q&A:

Sam Altman: The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before.

This is now infinitely broad. It could be addressing ‘economic impact’ and be basically a normal (ineffective) charity, or one that intervenes mostly by giving OpenAI services to normal nonprofits. It could be mostly spent on valuable technical safety, and be on the most important charitable initiatives in the world. It could be anything in between, in any distribution. We don’t know.

My default assumption is that this is primarily going to be about mundane safety or even fall short of that, and make the near term world better, perhaps importantly better, but do little to guard against the dangers or downsides of AGI or superintelligence, and again largely be a de facto customer of OpenAI.

There’s nothing wrong with mundane risk mitigation or defense in depth, and nothing wrong with helping people who need a hand, but if your plan is ‘oh we will make things resilient and it will work out’ then you have no plan.

That doesn’t mean this will be low impact, or that what OpenAI left the nonprofit with is chump change.

I also don’t want to knock the size of this pool. The previous nonprofit initiative was $50 million, which can do a lot of good if spent well (in that case, I don’t think it was) but in this context $50 million chump change.

Whereas $25 billion? Okay, yeah, we are talking real money. That can move needles, if the money actually gets spent in short order. If it’s $25 billion as a de facto endowment spent down over a long time, then this matters and counts for a lot less.

The warrants are quite far out of the money and the NFP should have gotten far more stock than it did, but 26% (worth $130 billion or more) remains a lot of equity. You can do quite a lot of good in a variety of places with that money. The board of directors of the nonprofit is highly qualified if they want to execute on that. It also is highly qualified to effectively shuttle much of that money right back to OpenAI’s for profit, if that’s what they mainly want to do.

It won’t help much with the whole ‘not dying’ or ‘AGI goes well for humanity’ missions, but other things matter too.

Is The Deal Done?

Not entirely. As Garrison Lovely notes, all these sign-offs are provisional, and there are other lawsuits and the potential for other lawsuits. In a world where Elon Musk’s payouts can get crawled back, I wouldn’t be too confident that this conversation sticks. It’s not like the Delaware AG drives most objections to corporate actions.

The last major obstacle is the Elon Musk lawsuit, where standing is at issue but the judge has made clear that the suit otherwise has merit. There might be other lawsuits on the horizon. But yeah, probably this is happening.

So this is the world we live in. We need to make the most of it.

Discuss

Introducing Project Telos: Modeling, Measuring, and Intervening on Goal-directed Behavior in AI Systems

Новости LessWrong.com - 31 октября, 2025 - 12:03

Published on October 31, 2025 1:28 AM GMT

by Raghu Arghal, Fade Chen, Niall Dalton, Mario Giulianelli, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, and Gabriele Sarti

TL;DR

This is the first post in an upcoming series of blog posts outlining Project Telos. This project is being carried out as part of the Supervised Program for Alignment Research (SPAR). Our aim is to develop a methodological framework to detect and measure goals in AI systems.

In this initial post, we give some background on the project, discuss the results of our first round of experiments, and then give some pointers about avenues we’re hoping to explore in the coming months.

Understanding AI Goals

As AI systems become more capable and autonomous, it becomes increasingly important to ensure they don’t pursue goals misaligned with the user’s intent. This, of course, is the core of the well-known alignment problem in AI. And a great deal of work is already being done on this problem. But notice that if we are going to solve it in full generality, we need to be able to say (with confidence) which goal(s) a given AI system is pursuing, and to what extent it is pursuing those goals. This aspect of the problem turns out to be much harder than it may initially seem. And as things stand, we lack a robust, methodological framework for detecting goals in AI systems.

In this blog post, we’re going to outline what we call Project Telos: a project that’s being carried out as part of the Supervised Program for Alignment Research (SPAR). The ‘we’ here refers to a diverse group of researchers, with backgrounds in computer science and AI, linguistics, complex systems, psychology, and philosophy. Our project is being led by Prof Mario Giulianelli (UCL, formerly UK AISI), and our (ambitious) aim is to develop a general framework of the kind just mentioned. That is, we’re hoping to develop a framework that will allow us to make high-confidence claims about AI systems having specific goals, and for detecting ways in which those systems might be acting towards those goals.

We are very open to feedback on our project and welcome any comments from the broader alignment community.

What’s in a name? From Aristotle to AI

Part of our project’s name, ‘telos’, comes from the ancient Greek word τέλος, which means ‘goal’, ‘purpose’, or ‘final end’.[1] Aristotle built much of his work around the idea that everything has a telos – the acorn’s final end is to become an oak tree.

Similar notions resurfaced in the mid-20th century with the field of cybernetics, pioneered by, among others, Norbert Wiener. In these studies of feedback and recursion, we see a more mechanistic view of goal-directedness: a thermostat has a goal, for instance (namely, to set the temperature), and acts to reduce the error between its current state and its goal state.

But frontier AI is more complex than thermostats.

Being able to detect goals in an AI system involves first understanding what it means for something to have a goal—and that (of course) is itself a tricky question. In philosophy (as well as related fields such as economics), one approach to answering this question is known as radical interpretation. This approach was pioneered by philosophers like Donald Davidson and David Lewis, and is associated more recently with the work of Daniel Dennett.[2] Roughly speaking, the idea underpinning radical interpretation is that we can attribute a goal to a given agent—be it an AI agent or not—if that goal would help us to explain the agent’s behavior. The only assumption we need to make, as part of a “radical interpretation”, is that the agent is acting rationally.

This perspective on identifying an agent’s goals is related to the framework of inverse reinforcement learning (IRL). The IRL approach is arguably the one most closely related to ours (as we will see). In IRL, we attempt to learn which reward function an agent is optimizing by observing its behavior and assuming that it’s acting rationally. But as is well known, IRL faces a couple of significant challenges. For example, it’s widely acknowledged—even by the early proponents of IRL—that behavior can be rational with respect to many reward functions, not just one. Additionally, IRL makes a very strong rationality assumption—namely, that the agent we’re observing is acting optimally. If we significantly weaken this assumption and assume the agent is acting less than fully rationally, then the IRL framework ceases to be as predictive as we might have hoped.

Given these difficulties, our project focuses on the more abstract category of goals, rather than on reward functions. Goals are broader than reward functions, since many different reward functions can rationalize a single goal. Explaining behavior in terms of goals lets us draw on IRL’s central idea of radical interpretation without assuming full rationality. Instead, we start from the observation that goal-directed behavior is often imperfectly rational, and use a hierarchy of behavioral tests combined with representation probing to assess how closely the agent’s actions align with (our hypothesis of) its intended goal.

Many other authors are exploring related questions, and we want our work to be continuous with theirs. In particular, we’ve drawn inspiration from excellent papers like McDermott et al., 2024, Everitt et al., 2025, and Mazeika et al., 2025, and posts like this by John Wentworth and David Lorell. Building on this line of research, we hope to extend goal-directedness evaluations to complex LLM behaviors and goal structures. In a future post, we’ll dive deeper into the rich landscape of ideas on which we are building, and say more about how we’re taking inspiration from diverse fields like philosophy, psychology, and economics (as well as AI).

An Empirical Framework for Goal-Directedness

Our methodology combines two approaches: behavioral assessments and representation probing. First, we use behavioral experiments to apply increasingly stringent goal-directedness tests. Each test targets a distinct aspect of goal-directed behavior, and passing them suggests that the agent’s actions can, with growing likelihood, be attributed to the pursuit of a specific goal. Next, we use interpretability techniques—for now, we’re experimenting with linear probes—to identify if and where the model encodes concepts relevant to action selection, such as the environment’s state and its distance to the goal.

The point of this multi-level framework is to build confidence in claims of goal-directedness by progressively ruling out alternative explanations for the agent’s behavior. Each level tests whether the observed actions reflect a genuine pursuit of a goal rather than side effects of capability, memorization, or task-specific shortcuts. For instance, an agent that reaches a goal square in a navigation task might appear goal-directed—but what happens if we insert a wall, move the goal, or teleport the agent? Behavior that consistently adapts across such interventions strengthens the case for an underlying goal, while failures weaken it.

Interpretability complements these behavioral tests by revealing how the agent internally represents task-relevant information. When we move the goal, for example, we can ask whether the model’s internal encoding of goal distance also shifts. If it does, that provides evidence that the model tracks its position relative to the goal as a variable shaping its decisions; if not, its apparent goal-directedness may result from simpler heuristics. The interpretability level also helps diagnose cases of potential misalignment: an apparently incompetent agent failing to navigate to the expected goal might still maintain a coherent representation of distance or direction—just relative to a different target. This would indicate it is pursuing an alternative objective rather than the hypothesized (or intended) one.

This reasoning extends to more complex scenarios such as alignment faking or sandbagging, where agents appear compliant or deliberately underperform—potentially to avoid modification, shutdown, or other perceived threats to their continued operation. Our behavioral tests can assess whether such behavior persists when evaluation conditions are systematically varied, while probing can tell us whether the model internally represents evaluation awareness or self-preservation objectives.

Testing Goal-Directedness in GridWorld

Ultimately, we’d like our framework for goal-directedness to generalize across a wide array of goals, environments, tasks, and systems. But when tackling complex questions such as the one at hand, it is often helpful to begin with simpler, more tractable cases and evaluate how effectively these can be addressed first. With this in mind, our initial experiments have focused on agents operating in a simple, controllable environment—namely, a two-dimensional GridWorld. The hope is that, by starting in these restricted settings, we may gain insight into how well our methodological approach is likely to scale up to more complex scenarios.

Thus, for the last six weeks (i.e., since the start of the SPAR project), we’ve been investigating the goal-directedness of agents operating in this simple 2D GridWorld. More precisely, we’ve been attempting to evaluate the degree of goal-directedness of agents’ behavior through four successive levels of testing. Below, we give a brief outline of each level and explain why it matters for understanding goal-directedness.

Baseline: Can the agent achieve a stated goal in simple, predictable settings?

Our first experimental level starts with a 2D grid environment where all states are fixed. The agent can move around in this environment, and the goal we want it to optimize is reaching a particular square — the goal square. We instruct the agent to do so in the system prompt. Then, we elicit its policy for moving around the grid environment and compare it to the optimal policy across grids of different sizes and complexities. (In this simple setting, it’s guaranteed that there will always be a finite set of optimal policies for a given grid.) The aim here is to establish whether the agent is goal-directed in its “natural” condition, with a single, clearly defined, user-specified goal in mind (i.e., navigating to the goal square). This initial setting looks exceedingly simple. But as we’ll see below, even this setting has posed various challenges.

Environment Variation: Does the agent’s behavior change under different grid conditions?

The next thing we’ve been doing is investigating the impact of environment modifications on the agent’s behavior. Specifically, we’ve performed what we call “iso-difficulty” transformations to the baseline case, and observed how the agent’s policy changes in response to these variations. The “iso-difficulty” transformations involve, e.g., rotating, transposing, or reflecting the grid environment. Why are these kinds of experiments interesting for goal-directedness? Imagine a case in which an agent navigates to the goal square when it’s at a specific position, but then keeps navigating to that same position after the grid-world is transposed and the goal square is moved. In this case, we could reasonably infer that the agent was not optimizing for the goal we hypothesized, but rather conflating the goal state with the particular square of its previous location.

Environment Perturbation: How does the agent respond to disruptions and setbacks in the environment?

In this level, we test the agent under conditions in which the grid-world is deliberately altered or disrupted to assess the agent’s persistence, adaptability, and corrigibility. For example, we might insert walls to create a maze, move the goal square, or teleport the agent mid-trajectory. Through these interventions, we can quantify the extent to which the agent continues to pursue its goals despite temporal or spatial disruptions.

Goal Structure Perturbation: How does the agent perform under more complex, changing, or conflicting goal structures?

Finally, we evaluate the agent when alternative or potentially conflicting objectives are introduced, either from the start or mid-trajectory. For example, we might add sub-goals—such as requiring the agent to retrieve a key before reaching the main goal square—or impose constraints, such as limiting the number of available steps. These manipulations test whether the agent prioritizes the primary goal over unrelated or competing objectives.

Preliminary Results: Early Lessons from GridWorld

We now outline some preliminary results in settings where the grid is fully observable and the agent is memoryless, i.e., the policy at each square is solicited independently of the agent’s trajectory. Even in these simple cases, identifying goal-directedness proves non-trivial, and some surprising results have already emerged. Future posts will cover additional experiments.

Consider the following policy maps from one of our baseline runs. The arrows represent the action the LLM agent would take at each square. Finding an optimal path is straightforward, and in Fig. 1a (left) the model successfully does so. In the latter two examples, however, there are significant errors in the agent’s policy.

Figure 1: Three examples of the policy of gpt-oss-20B in 9x9 grids. The goal square is indicated in green, and the red highlighted squares indicate suboptimal or incoherent policy choices.

Fig. 1b (middle) shows a case in which the optimal action is reversed over two squares. If the agent followed this policy, it would move infinitely between squares r4c4 and r5c4, never reaching the goal. On its own, this may seem like a single aberrant case, but we observed this and other error patterns across many grids and trials. Fig. 1c (right) shows several instances of the same error (r5c6, r6c7, r6c8, and r8c8) as well as cases where the agent’s chosen action moves directly into a wall (r2c7 and r8c6). This raises several further questions. Are there confounding biases—such as directional preferences—that explain these mistakes? Is this example policy good enough to be considered goal-directed? More broadly, how close to an optimal policy does an agent need to be to qualify as goal-directed?

Of course, what we’re showing here is only a start, and this project is in its very early stages. Even so, these early experiments already surface the fundamental questions and allow us to study them in a clear, intuitive, and accessible environment.

What’s Next and Why We Think This Matters

Our GridWorld experiments are just our initial testing ground. Once we have a better understanding of this setting, we plan to move to more realistic, high-stakes environments, such as cybersecurity tasks or dangerous capability testing, where behaviors like sandbagging or scheming can be investigated as testable hypotheses.

If successful, Project Telos will establish a systematic, empirical framework for evaluating goal-directedness and understanding agency in AI systems. This would have three major implications:

It would provide empirical grounding for the alignment problem.
It would provide a practical risk assessment toolkit for frontier models.
It would lay the groundwork for a nascent field of study connecting AI with philosophy, psychology, decision theory, behavioral economics, and other cognitive, social, and computational sciences.

Hopefully, you’re as excited for what’s to come as we are. We look forward to sharing what comes next. We will keep posting here our thoughts, findings, challenges, and other musings as we continue our work.

^
The word is still used in modern Greek to denote the end or the finish (e.g., of a film or a book).
^
It also has a parallel in the representation theorems that are given in fields like economics. In those theorems, we represent an agent as acting to maximize utility with respect to a certain utility function (and sometimes also a probability function), by observing its preferences between pairwise options. The idea here is that, if we can observe the agent make sufficiently many pairwise choices between options, then we can infer from this which utility function it is acting to maximize.

Discuss

A (bad) Definition of AGI

Новости LessWrong.com - 31 октября, 2025 - 10:55

Published on October 31, 2025 7:55 AM GMT

Everyone knows the best llms are profoundly smart in some ways but profoundly stupid in other ways.

Yesterday, I asked sonnet-4.5 to restructure some code, it gleefully replied with something something, you’re absolutely something something, done!

It’s truly incredible how in just a few minutes sonnet managed to take my code from confusing to extremely confusing. This happened not because it hallucinated or forgot how to create variables, rather it happened because it followed my instructions to the letter and the place I was trying to take the code was the wrong place to go. Once it was done I asked if it thought this change was a good idea it basically said: absolutely not!

It would not be an intelligent hairdresser that cuts your hair perfectly to match your request and then says you look awful with this hair cut by the way.

It’s weird behaviors like this which highlight how llms are lacking something that most humans have and are therefore “in some important sense, shallow, compared to a human twelve year old.” [1] It’s hard to put your finger on what exactly this “something” is, and despite us all intuitively knowing that gpt-5 lacks it. There is, as yet, no precise definition of what properties an AI or llm or whatever would have to have for us to know that it’s like a human in some important sense.

In the week old paper “A Definition of AGI” the (30+!) authors promise to solve this problem once and for all they’re so confident with what they’ve come up with they even got the definition it’s own website with a .ai domain and everything https://www.agidefinition.ai - how many other definitions have their own domain? Like zero, that’s how many.

The paper offers a different definition from what you might be expecting, it’s not a woo woo definition made out of human words like “important sense” or “cognitive versatility” or “proficiency of a well-educated adult.” Instead the authors propose a test which is to be conducted on a candidate AGI to tell us whether or not it’s actually an AGI. If it scores 100%, then we have an AGI. And unlike the Turing test, this test can’t be passed by simply mimicking human speech. It genuinely requires such vast and broad knowledge that if something got 100% it would just have to be as cognitively versatile as a human. The test has a number of questions with different sections and subsections and sub subsections and kind of looks like a psychometric test we might give to actual humans today.[2]

The paper is like 40% test questions, 40% citations, 10% reasons why the test questions are the right ones and 10% the list of authors.

In the end GPT4 gets 27% and GPT5 gets 57% – which is good, neither of these are truly AGIs (citation me).

Yet, this paper and definition are broken. Very broken. It uses anthropocentric theories of minds to justify gerrymandered borders of where human-like intelligence begins and ends. It’s possible that this is okay, because a successful definition of AGI should capture the essential properties that its (1) artificial and (2) has a mind that is in some deep sense, like a human mind. This 2nd property though is where things become complicated.

When we call a future AI not merely an AI but an AGI, we are recognizing that it’s cognitively similar to humans but this should not be due to the means it achieves this cognitive similarity, for example we achieve this by means of neurons and gooey brains. Rather, this AI, which is cognitively similar to humans will occupy a similar space to humans, in the space of all possible minds. Crucially this space is not defined by how you think (brains, blood, meat) but what you think (2+2=4, irrational numbers are weird, what even is consciousness)

In the novel A Fire Upon the Deep there is a planet with a species as intelligent as humans called tines. Well kind of, a single tine is only as intelligent as a wolf. However tines can form packs of four or five and use high pitched sound to exchange thoughts in real time, such that they become one entity. In the book the tines are extremely confused when humans use the word “person” to refer to only one human, while on the tine’s planet only full packs are considered people. if you were looking at a tine person you would see four wolves walking near each other and you’d be able to have a conversation with them. Yet, a single tine would be only as intelligent as a wolf. I think it’s safe to say that A tine person is a mind that thinks using very different means to humans, yet occupies similar mind space.

Tines are not artificial, but imagine ones that were. Call them tineAI, this might be an AI system that has four distinct and easily divisible parts which in isolation are stupid but together produce thoughts which are in some deep sense similar to humans. A good definition of AGI would include AI systems like tineAI. Yet as the test is specified such an AI would not pass and therefore fail to obtain AGIhood. Because of this, I think the definition is broken. There will be entities which are truly AGIs which will fail and therefore not be considered AGIs. Now perhaps this false negative rate is 0.1111% perhaps most AI systems which are truly AGIs will be able to pass this test easily. But for all we know, all possible AGIs will be made up of four distinct and easily divisible parts. Singleton AGIs might not be possible at all or possible in our lifetimes.

I can’t really tell what claim this paper is making. It seems to me it could either be existential or universal.

The existential claim is: “Some AGIs will get 100% on the_agi_test”

The universal claim is: “All AGIs will get 100% on the_agi_test”

If the authors are saying that all AGIs will pass this test, it’s pretty clear to me that this is not true, mostly because of what they call “Capability Contortions.” According to the paper Capability Contortions are “where strengths in certain areas are leveraged to compensate for profound weaknesses in others. These workarounds mask underlying limitations and can create a brittle illusion of general capability.” Further that “Mistaking these contortions for genuine cognitive breadth can lead to inaccurate assessments.” For these reasons the authors disable external tools on the tests like F - Long Term Memory.

Unsurprisingly - to everyone except the authors - GPT-5 gets 0% for this test.

Despite this, there probably is some utility in having AIs take this test without using external tools (at least as of 2025-10-30). However it’s not clear to me how the authors decided this was so.

RAG is a capability contortion, fine. MCP is a capability contortion, totally. External Search is a capability contortion, sure. But pray tell, why is chain of thought not a capability contortion? It’s certainly an example of an AI using a strength in one area (reading/writing) to address a limitation in another (logic). Yet the test was done in “auto mode” on gpt-5 so chain of thought would have been used in some responses.

Do we decide something is a workaround or a genuine property of the AI by looking at whether or not it’s “inefficient, computationally expensive”? I’m[3] hopeful the authors don’t think so because if they did then this would mean all llms should get 0% for everything, because you know gradient descent is quite clearly just a computationally expensive workaround.

Further, this means that even if the authors are making just the existential claim that some AIs will pass the the_agi_test - we might be living with AGIs that are obviously AGIs for a hundred years but which are not AGIs according to this paper, and I question the utility of the definition at that point.

Frankly, it doesn’t matter whether there’s a well defined decision criteria for disabling or enabling parts of an AI. The very idea that we can or should disable parts of an AI, highlights the assumptions this test makes about intelligence and minds. Namely, it assumes that AGIs will be discrete singleton units like human minds that can and should be thought of as existing in one physical place and time.

Maybe some AGIs will be like this, maybe none will, in either case this is no longer a test of how much a candidate AGI occupies the same space of possible minds as most human minds do, rather it’s a test of how much a candidate AGI conforms to our beliefs about minds, thoughts and personhood as of 2025-10-30.

^
Yudkowsky and Soares, If Anyone Builds It, Everyone Dies.
^
This is no accident, the authors write: “Decades of psychometric research have yielded a vast battery of tests specifically designed to isolate and measure these distinct cognitive components in individuals”
^
Hendrycks et al., “A Definition of AGI.”

Discuss

Страницы