Вы здесь

Сборщик RSS-лент

We Need Breadth-First AI Safety Plans

Новости LessWrong.com - 1 июня, 2026 - 20:36

Cross-posted from my website.

Depth-first plans lay out a path from here to aligned superintelligent AI. We need those kinds of plans. But depth-first plans depend on many assumptions: "We will make AI safe by doing step 1, then step 2, then step 3." Step 1 only works under condition A, step 2 requires condition B, step 3 requires condition C. If A or B or C is false, the whole plan fails (and there's a good chance we all die).

Consider Google's safety plan from April 2025. To my knowledge, this is the best among the frontier AI companies' plans. [1]

Google's plan depends on a series of conditions:

  1. For the most part, the plan does not consider concrete details of how significantly-more-capable AI systems will behave, instead proposing that Google will figure out how to handle those systems once it understands them better. This only works given (at least) two conditions:
    1. AI capability improvements occur at a relatively predictable pace, with no unexpectedly large jumps.
      • The plan explicitly assumes no "discontinuous" improvements, which is roughly the same thing. It's good that they're being explicit about this.
    2. Once stronger capabilities emerge, there will be enough time to figure out mitigations.
  2. The plan entails putting stricter measures in place once AI systems become sufficiently capable. This depends on at least two conditions:
    1. Google (or somebody) can accurately determine what capability level is dangerous.
    2. Google's evals (or third-party evals) can elicit dangerous capabilities if they exist.
  3. The plan requires using AI to bootstrap AI alignment. This depends on several conditions:
    1. We can successfully align the AI that we use for bootstrapping, or misalignment will be easy (enough) to spot, or alignment isn't necessary (e.g. because humans can use amplified oversight to monitor smarter-than-human systems).
    2. Future Google can be trusted to use enough of its compute to differentially accelerate alignment research, rather than doing something more profitable (for example, differentially accelerating AI R&D).
    3. AI that's useful enough to solve AI alignment does not pose an existential threat.
    4. AI alignment is the sort of thing that can, in principle, be solved by strong-but-not-superintelligent AI.
      • For example, it may be that moral advances are required before we know how to correctly specify how AI ought to behave; and that unaligned AIs cannot contribute to moral advances. [2]

(The plan depends on many more conditions than that, but I'll keep it short.)

That list included eight conditions. If any one of those conditions fails, then the whole plan fails. Some of the conditions seem likely to be true; others seem questionable. But even if every individual condition is probably true, it's much less likely that they're all true.

Disjunctive conditions are better than conjuctive ones. We can see an example in condition 3.1 above: Google's plan can work if it's possible to align the "bootstrapper" AI, OR if misalignment is easy to spot, OR if it doesn't need to be aligned. Disjunctive conditions are good; more of those, please.

We need breadth-first plans:

  • We will take actions X, Y, and Z.
  • X depends on condition A.
  • Y works even if A is false, but it depends on condition B.
  • Z works if A and B are false; it depends on a third condition C.

X + Y + Z works even if two out of three conditions fail.

Some plans have a little bit of breadth. An explicit example from Google's safety plan:

Our approach has two lines of defense. First, we aim to use model level mitigations to ensure the model does not pursue misaligned goals. [...] Second, we consider how to mitigate harm even if the model is misaligned (often called “AI control”), through the use of system level mitigations.

I would like to see more breadth, and recursive breadth—there should be breadth within each component of the plan, and breadth within those sub-components.

The broadest plan that's been published is Peter Barnett & Aaron Scher's AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions (see also the corresponding LessWrong post). The report explicitly considers four possible future scenarios and how we might achieve a good outcome from within each scenario. The report even includes a flowchart:

The report goes into more detail about the conditions required for each of the four scenarios to succeed.

Barnett & Scher believe "Off Switch and Halt" is the best strategy. They don't exactly phrase it this way, but according to their report, "Off Switch and Halt" depends on the fewest conditions and has multiple ways of succeeding.

How breadth-first plans can inform what we do

I see two big benefits to writing breadth-first plans:

  1. We can identify which paths to success depend on the fewest conditions, [3] and focus more on those.
  2. It's easier to find the biggest holes in the plan.
Root-level breadth matters most

The good news is the branches off the roots are the most important because they have the greatest probability mass. Creating layers of branches off branches off branches quickly gets complicated, but I don't think it's necessary.

My rough attempt at categorizing plans

I made a quick flowchart to categorize AI safety plans at a high level.

  • A blue circle indicates an action
  • A blue square indicates an outcome
  • A red hexagon indicates a necessary condition to achieve an outcome
  • A red pentagon indicates a condition that is helpful but not necessary

The idea is that we need a broad set of overlapping plans such that some plan will work, even if many conditions (red nodes) turn out to be false.

(Click here to see the full-size image.)

Is this flowchart comprehensive? Definitely not. Is it even accurate? Maybe. My point is that, to make AI safe, we need multiple plans that cover all the ways the other plans could go wrong, and this flowchart is a quick attempt at representing some of those plans.

Future work I'd like to see
  1. AI companies should publish breadth-first plans. What will they do if a step in their mainline plan fails?
  2. Governments should pass legislation requiring AI companies to have plans that cover every item on a list of possible future scenarios.
    • For example, mandate that companies have different plans for different takeoff speeds.
    • AI safety researchers should do research to inform what future scenarios need to be covered.
  1. I originally wrote this article shortly after April 2025, but I procrastinated for a year on finishing it, so I'm not sure about the current state of AI companies' plans. ↩︎

  2. I am skeptical that a bootstrapped-aligned AI will behave morally in ways in which most humans do not behave morally, e.g. eating factory-farmed animals; or that it will be able to correctly resolve the internal inconsistencies in common-sense ethics. For example, in the mere addition paradox, most people accept a set of premises but reject the conclusion that necessarily follows from those premises. [4] ↩︎

  3. Technically, what we want isn't paths that depend on few conditions. We want paths where the joint probability of every condition is as high as possible. But generally speaking, fewer conditions means the probability of success is higher. ↩︎

  4. Philosophy Experiments' Philosophical Health Check asks you a series of questions and purports to identify inconsistencies in your beliefs. I think the questions leave some wiggle room to argue that supposed inconsistencies aren't truly inconsistent, but a more rigorous test would be harder to construct. ↩︎



Discuss

The remarkable story of AIGS Canada

Новости LessWrong.com - 1 июня, 2026 - 17:07

TLDR: Four years ago we put out a short post on LW announcing that an AI governance and safety community had formed in Canada. This is the remarkable story of what happened next. Information on how to join or support us is shared at the end. I will also be at Less Online.


Imagine humanity in a few years, and the development of advanced AI has gone well. We navigated the risks of catastrophic loss of control and kept the most powerful tools out of the hands of bad actors. The benefits and power it created were sufficiently shared. We ended up in a world the vast majority of human beings want to be in.

We’re shaking our heads and smiling in relief and disbelief that we all made it through, and agreeing “...a lot of things had to go well for this to happen, people and institutions around the world had to step up, and also - thank God for f*#king Canada...”.

In a world that will need all the help it can get to navigate AI, every country should set their sights that high.

And so we ask: What would Canada’s contribution have been? Did Prime Minister Carney leverage his Davos speech leadership into effective global coordination on AI? Perhaps it was seed funding for critical AI safety research that unlocked key technical solutions. Or we piloted the first full-scale national conversation on ASI, gaining key insights from the broader public and shaping a global narrative as to what success on AI even looks like.

At a minimum, Canada would need to be situationally aware vis-à-vis superintelligence and making smart decisions.

But for the last few years, the main decision makers in the country have not been giving any indication of this kind of awareness, either in words or deeds. Despite growing numbers of parliamentarians and officials who have been briefed on superintelligence and expressed sincere concern, it has yet to become a political priority in Ottawa.

Enter AI Governance and Safety Canada (AIGS), a nonpartisan not-for-profit launched in 2022 with the question “What can we do in Canada, and from Canada, to ensure positive AI outcomes?” and a talented and determined team of concerned citizens.


The results


To a large extent, our story can be told through what we accomplished. Three years in, our answer to that founding question has included:

  • “A Plan for Canada” policy white papers: Widely respected for their quality, they succinctly clarified what exactly Canada can do (and do well) to positively influence AI outcomes. Notably, the government adopted the top recommendation from each of our 2023 and 2024 white papers (few other orgs were calling for these actions)
  • Dozens of meetings with parliamentarians and government officials: The 2025 white paper messaging in particular had many MPs asking how they can help, and inviting me to testify at committee
  • Seven expert testimonies before the House of Commons and Senate, in English and French. One of which went viral (2.3M views / 119k likes on IG)
  • Comprehensive recommendations on the AI & Data Act: more than any other organisation submitted, translating ASI risk into practical wording for the Bill
  • Media coverage in most major outletsCBC The NationalThe Canadian PressRadio-CanadaCTV news, Op-Eds in the Toronto Star, and more
  • Connecting 1,500 Canadians across the country online and through events, and attracting over 700 volunteer sign-ups.
  • And many other initiatives along the way


In doing this, we’ve mirrored some of the work of organisations in other jurisdictions such as the EU’s The Future Society and the UK’s Centre for Long Term Resilience. More recently, we’ve been joined in Ottawa by other organisations (such as CIGI and Control AI) doing education and advocacy on AI’s catastrophic risks. And of course, leading scientists Yoshua Bengio and Geoffrey Hinton continue to engage governments and talk to the media.

So what makes us remarkable is not so much our notable accomplishments, or even that we were the first civil society group to do this work in Canada and still the leading one… it’s that we accomplished anything at all.


And that’s the story I want to share now.


Inauspicious beginnings


So at the start we confidently set off with all these accomplishments in mind, got a big grant, hired a team, and set to work, right?

Not quite.

In the Summer of 2022, there were just a couple local meetups and a few dozen people in Canada who had happened to come across the concerns around AGI development and were interested in doing something about it.

I’d pitched EA’s LTFF on a one-year grant to “connect, expand and enable the AGI safety community in Canada”, and got to work finding and connecting people. That’s when I met Mario Gibney, the founder of the Toronto AI Safety Meetup (which has since flourished into the Trajectory Labs co-working and event space) who became my co-founder. Evan Murphy (Vancouver AI Safety researcher) and Briana Brownell (Saskatoon AI startup founder) joined us to form an interim Board of Directors.

That Summer and Fall we created our online Slack community to bring the disparate local groups together, chose a name and website, attended as many AI conferences as possible to get a lay of the land, and prepared to incorporate as a not-for-profit.


A Spring and Fall


Only there was a challenge: how were we going to fund AIGS? It was all fine for me as a community organiser to get a grant, but when we started thinking about what AIGS most needed to accomplish - to directly influence the Canadian government - we realised that traditional EA/OpenPhil AI Safety grants weren’t an option. Then recent FTX scandal aside, they were American (i.e. foreign funding, a credibility risk) and in any case as charities couldn’t fund our core political activity.

So we put on a brave face and did an initial fundraiser among 50 community members, with modest success, and AIGS was incorporated April 4, 2023. Our first move was media advocacy to seize on the Pause AI letter coverage and establish our public presence. Mario and I brought on operations and communications contractors to amplify our efforts. Our Toronto Star Op-Ed was a highlight:


It almost worked. That July and August saw tantalising funding opportunities during conversations with two large donors we’d attracted, but despite our efforts it didn’t convert into money in the bank. We ended the contracts, Mario had to step back, and AIGS was left as a volunteer organisation - just a Board of Directors and me as the only unpaid staff.


The grind begins


We could easily have disbanded at that time. My grant had expired in April and I had limited savings.

But Canada still needed an organisation like AIGS, and we weren’t about to stop working on the most important issue of the 21st century just because the money ran out. We also knew that as AI’s impacts grew, our potential as an organisation would too.

So I took on some personal debt and we got back to work. First, we knew we needed to clarify what exactly our calls to action for Canada were. From that came our first white paper Governing AI: A Plan for Canada, which put us on the policy map.

Second, parliament had recently introduced the AI & Data Act. We went all in - spent weeks clarifying what Canada needed an AI Bill for, and carefully translating the concerns around ASI loss of control into specific recommendations for the Bill.


Our first big break - invitation to testify at the AI & Data Act committee hearing


During that time we also expanded our Board of Director with a range of professionals (including Board Chair Gordon Vala-Webb) to help steer AIGS to success, and launched a Board of Advisors with respected experts to consult on key decisions.

Dashed health, dashed hopes


All that effort didn’t save us. While it did establish our credibility and helped raise modest donations over Christmas, we were still a volunteer organisation, and my runway was now even shorter.

To make matters worse, a week after capping our AI & Data Act testimony, I fractured my femur in an accident. It wouldn’t be the only health issue to significantly slow me (and by extension AIGS) down - I have a chronic condition that among other things can cause severe fatigue and brain fog, and make looking at a screen quite painful. The symptoms are, of course, worsened by the stress of repeatedly having to focus on catastrophic risks from AI and the loneliness of being the only staff.

Seeing that our direct fundraising outreach wasn’t sufficient, we pivoted to launching a project that Canada needed that might also gain corporate or union sponsorship: a National Conversation on AI event series pilot. The goal was to meaningfully engage Canadians in a two-way conversation about where AI is headed and what kinds of futures people want.

It was (and remains) a worthy initiative with interest from a number of universities and civil society organisations, even getting an endorsement from Yoshua Bengio. But five months of work later, it failed to gain any major financial sponsors, and so it was put on the shelf for another day.

That failure meant that by the Summer my runway was now gone and I soon wouldn’t be able to continue as full-time executive director.

A second chance


And then, lo and behold, some money trickled in. At the last moment, a donor stepped up just enough to keep me on full-time and AIGS moving forward.

Moreover, 2024 was the year that volunteers started to show up in numbers. So much so that we had to set up a dedicated intake and onboarding process.

The best news came when Kathrin Gardhouse - Toronto-based lawyer, PhD, and policy expert - joined and immediately started taking on projects, quickly getting promoted to Policy Lead and then to the Board of Directors.

So in September 2024 we looked around and asked “What does Canada most need now?”. With veteran political expert Fraser Green now on our Board, we realised that while up to date policy recommendations would continue to be essential, on their own they were too easy for government to ignore. Polls were showing an overwhelming likelihood of a conservative victory in a 2025 Fall election, and neither Poilievre nor Trudeau were the type to act on the arguments alone. Also, with the acceleration in AI, it seemed very plausible that 2025 might be the last federal election before superintelligence was developed.

So we pivoted in the final months of 2024 to launch the public-facing Coalition for Responsible AI. The idea was to plant a flag so that everyone in Canada who cared about these issues could find us and support the cause. It was primarily a communications campaign - engage Canadians, get ourselves in the news, attract donors, and make AI an election issue politicians had to address.

We launched in January, with synchronised events in 4 cities:

Supporters gather in Ottawa for the Coalition launch event


Politics happens, and also we fall short


In 2025, the first thing to happen was Trudeau resigning and Mark Carney taking over. He immediately called an early election, shrinking our time to prepare from 10 months to 3 months. Meanwhile Trump got inaugurated and began soaking up all media attention, making Canada’s election about who can best stand up to him. AI (and even major political items like cost of living) got drowned into the background.

We were also relying heavily on a new communications vendor to get us in the news, but they underperformed (especially in English media). And I was still the only full-time staff to keep the organisation running, meaning that I was stretched too thin and also underperformed.

The Coalition failed at its goal of getting public attention.

And when it rains, it pours: in mid 2025 a series of key grants we’d applied for got rejected (in large part because we could only apply for the portion of our work that was apolitical), I was running on empty again, and this time our overstretched donors weren’t able to fill the gap.

The writing was on the wall: the lights were about to go out on AIGS.


Stepping into the abyss


What were we to do next?

One of the things we noticed about Mark Carney is that he had significant experience managing global crises, and his book Values suggested he was a man who cared more about the arguments than public sentiment. Whereas Poilievre and Trudeau would have required a big public advocacy campaign to act, for Carney, a well-crafted plan delivered via trusted advisors seemed like the better approach. We also knew that regardless, we needed to update our white paper for 2025, and it would be the best thing to try fundraising for.

Money was all but gone, but we made a decision:

If we were to go under, we’d do so delivering one final piece of impact.

Could we hold on long enough to deliver?

Having spent the Summer drafting our 2025 white paper while battling a major health flare-up, discussing bankruptcy contingencies with Board Chair Gordon Vala-Webb, and preparing one last fundraising email, the situation came to a head on Sept 1st, 2025.

It’s a day I still distinctly remember.

AIGS’s bank account had nothing left in it, and we owed four thousand to the vendor. I checked my personal accounts - I also had nothing left in the bank, and all three of my credit cards were maxed out.

I emailed my landlord to let her know I was going to be late paying my rent.

The next day we raised $20k.

That last fundraising email, and the pitch around the white paper, had worked.

We then raised another $20k in the following weeks to finish the year at $80k in revenues, which was double our 2024 revenues. This year, thanks to some incredible donors, we’re already at $150k earned or pledged.


A stellar year


The Fall and Winter of 2025-2026 turned into our biggest success by far. The white paper was serendipitously ready to be published right when the new Minister of AI called a snap 30-day public consultation on the new national AI strategy.

The new revenue also allowed us to temporarily bring on a part-time outreach coordinator, who made sure the hundreds of emails and follow ups got to the relevant MPs. That turned into dozens of meetings, which turned into six invitations to testify at committee hearings. The video from one of them then went viral on social media, our biggest visibility yet.

Meanwhile, communication expert Dalia Ezzat volunteered to shape the next chapter of AIGS communications, Shivangi Pandey stepped forward to relaunch the Coalition for Responsible AI later this year with a new vision, and Christopher Tiller our volunteer Volunteer Manager started putting in long hours to help keep the community glued together.

We’re alive, and stronger than ever.


700 volunteers and 1 underpaid staff

But a yearly budget of $150k CAD, as immensely relieving as it is compared to previous years, is still not enough to run an organisation on. It means we now have a minimum of stability, but also that we still can’t afford to hire a team.

And that’s been our bottleneck. As stressful as working on catastrophic risks, battling health issues, and surviving existential financial crises has been for the last 3 years, the greatest challenge has been not having any full-time staff to work with me.

Our core volunteer team continues to pitch in remarkable amounts of work - crowdsourcing relevant news, supporting local events, building our tech stack, developing our Canadian AI policy course, and shaping our communications strategy. And the growing number of sign-ups is a huge source of potential for AIGS.

But even the best only have a few hours per week, or are between jobs and have to step back as soon as they regain employment, meaning work had to be shared across multiple volunteers and there is naturally high turnover. Moreover, we’re a remote team spread out over 5 time zones and 2 languages, making maintaining team energy, cohesion, and momentum exceptionally hard in casual or part-time work setups.


Back to imagining


Now imagine if instead we had a core team of talented communications, operations, and advocacy leaders working full-time together to shape our strategy and harness our rapidly growing volunteer base?

If AIGS were able to poach some of the top talent currently working for corporate interests, and put their skills to ensuring humanity safely navigates AI?

And if Canada actually took those initiatives that a successful post-AI world will be shaking its head about in disbelief and gratitude?

If you’d like to see that happen, you can help.


How to help:

  • Liked our story and want to see us succeed?
    • Give this post an upvote
    • Share it to your preferred platform
    • Email it to a potential Canadian donor who cares about AI going well (or to someone who might know someone who might know someone)
  • Canadian citizen or resident?
    • Join us as a donor:
      • Small donations expand our donor base and help us show broad support. Cherry on the cake? Make it recurring. Donate here.
      • Large donations take AIGS to the next level of impact. For more information and to meet the leadership, email contact@aigs.ca.
    • Join us as a volunteer. Show us how good you are so we can hire you when funding comes through.
    • Join our community online or at events, and help us build momentum in Canada. All are welcome.


Thanks for reading this post and hearing our story. And wherever you are in the world, know that while AI is putting us all under great strain, the human spirit and determination to succeed remain alive and well.


Yours truly,

Wyatt and the AIGS team and community.


*Note for Bay Area readers: I will be at Less Online (giving a 'Dispatch from Ottawa' talk Sunday morning) and in town a little bit after. If you'd like to connect, please reach out or DM me.



Discuss

Superintelligence of the gaps

Новости LessWrong.com - 1 июня, 2026 - 16:00

Many classic AI doom scenarios rely on superintelligence using its vastly superior intelligence to outplan, outcompete and outkill you.

I partly believe this: superintelligence would definitely outkill me.

But I don't believe we will build such superintelligence; not because humans are the apex of intelligence, but because superintelligence, implicitly, has always been about a gap: the gap between the current best intelligence and the newly created one.

We're not in the world where AIs are being created with large gaps of intelligence between each other. Rather, we are in an iterative intelligence development and deployment world. It is technically easy to not have large gaps of capabilities between the current best model and the next, it is ~easy (if costly) to evaluate at regular checkpoints, and ~continuous deployment allows there to be no large gap in deployment either.

We can thus steer away from a large number of doom scenarios (those where new AI uses its greater capabilities to take over) by simply not creating&deploying models much smarter than the previous thing. The current most intelligent and aligned beings should always be supervising their successor, using more total resources at first, such that they can't effectively be tricked/subverted.

I guess the above is something many "AI optimists" have in mind and I don't think the technical ease of avoiding large capabilities gaps should be much of a crux. Whether in practice we'll be avoiding these gaps seems the more interesting crux for "fast misaligned AI takeover" scenario discussion. This is correctly done in @Daniel Kokotajlo et al's AI 2027: the bad ending is caused by pressures to premature deployment leading to using a suspected misaligned system, not by technical impossibility of knowing it's misaligned. It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems. (I don't think evidence of recklessness with regards to weak systems is strong evidence of recklessness towards strong system, though corporate and national governance should be setup to have the mere possibility of not being reckless when the time comes) It's looking like we're in world C or D of @ryan_greenblatt plans for misalignment risk (~we don't get a pause, but the leading companies are somewhat careful) and that this is technically sufficient to avoid most fast misalignment doom scenarios.

Most of my p(doom) is thus not on the chance of misaligned AI takeover, but on gradual disempowerment risks.

I don't think we have good solutions here, but at least we have more time to look for them.



Discuss

Lean, not backpressure

Новости LessWrong.com - 1 июня, 2026 - 10:57

Lucas Costa has written a good article on how to build systems that can handle code-generating robots. Unfortunately, when calling it backpressure, he used the wrong metaphor.[1] Backpressure is about signaling to upstream processes that they running too fast and need to slow down. Note that the suggestions presented by Costa are mostly about signaling to the upstream process that it needs to do things differently, rather than just slow down. This has more to do with ensuring sufficient quality is sent downstream, rather than quantity.

This irked me. As I was reading, I was searching for the right analogy. I kept coming back to lean manufacturing. The more famous half of the lean philosophy is waste reduction. The other half is about managing the unstable input of people. That’s what we’re interested in here.

A common approach to the input of people – especially in lower-skilled jobs – is to make line workers responsible for everything. We ask them to be hypervigilant, tell them to never make mistakes, and let them know that if they don’t always perform at their best, they will be chastised … or fired.

Lean, as it is described[2], is much more respectful of line workers and the conditions they are performing their work in. A process designed in the lean philosophy tolerates workers that don’t always perform at their best.[3] It’s about setting up processes and structures that have positive optionality on people’s creativity, without undue requirement on their level of responsibility.

This can take many shapes, but the Costa article reminds me of three concrete practices:

  • Single-piece flow means working on one thing at a time, so downstream processes have a chance to reject before too much of the wrong thing is produced.
  • Autonomation (or jidoka) means giving a machine the ability to detect when something is wrong and not continue at that point.
  • Poka-yoke is a process that forces results to be conformant by construction.

You probably recognise these things as good, but a surprising number of managers seem to think they can just chastise people until quality improves. They talk themselves into this because they believe line workers are fully responsible for their actions.[4]

But even those managers will find it very hard to convince themselves quality improves when they scream at the code-generating robot. It’s a robot! It can’t be responsible for its actions. We have to adopt the lean philosophy for building systems around robots. When something goes wrong, we have to blame the process, not the robot.

We always had to do that, even with people, but with robots it’s painfully obvious.

  1. ^

    Which, to his credit, he seems aware of. It’s just that he’s spent too much time using the wrong metaphor that it’d be silly to switch now.

  2. ^

    It may be different in practice; I’ve read some conflicting accounts.

  3. ^

    One of my favourite things to tell myself when I’ve messed up system safety is “If I designed a process that assumed people would never make mistakes, then whose fault is it really?”

  4. ^

    They aren’t. As Deming said, a bad system will beat a good person every time.



Discuss

My reactions to “I underestimated AI capabilities (again)”

Новости LessWrong.com - 1 июня, 2026 - 04:00

An application response I wrote! Please feel free to leave any feedback!


Describe a recent paper or blog post that has influenced your perspective on AI safety.

“I underestimated AI capabilities (again)” (https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities) came out at the beginning of March. In one sentence, author Ajeya Cotra made capabilities predictions in January 2026, and they were outpaced within 2 months. Specifically, in January, Claude Opus 4.5’s 50% task horizon was ~5 hours. Continuing with the historical doubling trend, Cotra predicted that by December, it’d reach ~24 hours (rounded up); but just six weeks later, Opus 4.6’s was already estimated at ~12 hours. The benchmark underlying the metric is already nearing saturation, when the metric was explicitly designed to avoid this; uncertainty exploded to between 5-66 hours.

Cotra then conjectures that once time horizons exceed, say, 80 hours, the metric may lose its meaning altogether, as large software projects actually benefit from decomposition and parallelisation. Thus agents will be able to coordinate to tackle arbitrarily large tasks. The time for a single human to do something is no longer a viable metric; at the very least it must now be the time it’d take for a human team.

In this sense, the benchmark fails to discern meaningfully between models at the frontier because the frontier — the end of the ruler — has been reached. There seemingly aren’t any hundreds-of-hours long tasks that, for humans in real life, wouldn’t be decomposed into teamwork anyway. This has influenced me to believe that the basic science of evaluations and risk assessment is extremely important, as our ontologies going forward may need to be refactored or even reconstructed ground-up. Cotra’s January prediction was pretty reasonable; it all but shows that we don’t have a stable, methodically-derived base rate to extrapolate trends from. And even if we did, capabilities advanced so quickly that we now need to measure something different anyway (agent coordination being a categorically different framework). I question to what extent human-comparability will remain useful as a metric at all.

This one blog post hasn’t made me doomerist, but given again the possibility for emergent, non-domain-slash-task-specific capabilities as purported by the Platonic Representation Hypothesis, assessment frameworks going forward will definitely need profound and thorough design methodology. I recall my first EAG, where Toby Ord emphasised neither long, nor short, but broad timelines — capturing robust, instrumentally useful action items when uncertainty is high. Adapting our first principles in this fashion throughout the knowledge pipeline, from empirical experimentation to expert recommendation to institutional design, may be necessary to build truly accurate predictive world models.



Discuss

Lizardmen are Not Constant - A Introductory Primer to Thinking about Survey Data

Новости LessWrong.com - 1 июня, 2026 - 03:28

The quality of a survey is best judged not by its size, scope, or prominence, but by how much attention is given to dealing with the many important problems that can arise.

-Fritz Scheuren, "What is a Survey?" American Statistical Association, 2004

First a note on scope: this is a brief discussion meant to--hopefully--assist readers in thinking more clearly about how to look at survey data. I will not, however, innumerate all of the issues and considerations that should go into considering surveys. At the end, I include links to some freely available guides for survey research and best practices which I would recommend for anyone who has a greater interest in survey data. Largely such publications are aimed at researchers conducting surveys, but the guidelines provide strong reference points to other standing the things that should go into surveys.

I would be remiss not to acknowledge the initial impetus for this 'primer' is comments that seem to apply the 'Lizardman constant'. Scott Alexander's own 2013 essay on the topic looks at examples from public opinion surveys ('polls') and draws an (almost) entirely correct conclusion (emphasis added): "When we’re talking about very unpopular beliefs, polls can only give a weak signal. Any possible source of noise – jokesters, cognitive biases,[1] or deliberate misbehavior – can easily overwhelm the signal. Therefore, polls that rely on detecting very weak signals should be taken with a grain of salt."

There seems, however, to be issues as the catchy jargon and title "The Lizardman Constant is 4%" seems to be taken by some readers of Scott Alexander (I do not know whether or not he would endorse the view) to mean "badness is in pretty much every survey at nontrivial percentages" as "[a] constant is always present." At a foundational level, this--I fear--is a lazy, unhelpful way of thinking about survey data. It also is quite different from the attitude one Scott advocated in his essay: Scott's conclusion is focused on 'polls'[2] looking at "very unpopular beliefs" and taking results "with a grain of salt" not (as is sometimes done) dismissing results that fall below a 4% threshold as at core unreliable.

At a foundational level, I fear this is simply leading to a lazy, uninformative way to view survey results that is likely to promote biases. If there are a just two things I hope you take away they are these:

  1. There is no hard and fast rule for judging surveys: surveys need to be assessed individually on the basis of their nature and purpose.
  2. Most of the threats surveys are vulnerable to are not constant, different types of survey's are vulnerable to different types of problems.
Lizardmen are Not Constant - Not Even in Polling

Let's address the claim that the lizardman constant is a constant. The problem Scott Alexander's essay addresses is one that in academic literature is more often referred to as "bogus respondents" or "spurious response bias", which is to say that a survey may have responses that are not-genuine and these may bias results. Some surveys and results are very vulnerable to this kind of error in other cases the risk is negligible.

To illustrate what this looks like, let's imagine in the real world 0.5% of people think the earth is flat. We post a public (and as such non-probabilistic) online poll soliciting responses to the question "Is the Earth round? (Y/N)" and get 1,000 responses, 40, or 4% (95% CI: 3.0-5.0%) say the Earth is flat. Excluding other biases, we might imagine that if we could read the minds of the respondents, we would observe something like this, with the 'bogus' responses highlighted in red and the genuine responses in green.

A first thing we should note looking at this case of non-probabilistic opt-in polls (which are included in the demonstrative). First, bogus responses are not randomly distributed. Bogus responses are much more likely to false positive answers than false negative ones, if given a series of choices they are more likely to pic the first choice, and, interestingly enough on surveys that include demographic data they also tend to self-identify as Hispanic or Latino.[3] This is important because it means we cannot just subtract out some constant value, positive results are more likely significantly biased towards bogus respondents.

Let's say we run the survey again, except this time we take a probabilistic sample and we call, say, 2,000 randomly selected addresses with linked landlines and get 900 responses that might now look something like:

You can see trivially how the bogus respondent problem is reduced but remains substantial our estimate this time would be 2.1% (with a 95% CI of 1.3%-3.3%). the number of bots with landlines registered to addresses is effectively zero, we are no longer getting bogus bot responses. However, some people are still may give different answers from their actual beliefs, you may have some people who are annoyed at having their dinner interrupted by a pollster so give bogus answers just for the hell of it, or might mishear the question to give a couple examples. Also, there still are certain systematic biases which mean we are unlikely to be able to assume the bogus answers are randomly distributed (e.g., respondents may try to give the answer they think the pollster wants).

Additionally, survey length, what questions are asked, how they are asked and incentive structures and other factors can all influence the rate and characteristics of bogus respondents.

One might still think that even though rates may vary, the bogus respondents themselves are always an issue. This is not true. In practice, for example probabilistic panel surveys generally observe very low rates of bogus responses, of approximately 0 (depending on the exact survey methods and coding).[4] In addition, most major panel surveys also will include various controls and cleaning to minimize various forms of bias. Panel surveys may go further and match respondents against externally validated data. Imagine, for example, a study looks at health consequences for patients receiving care for the flu. It recruits patients across a set of hospitals using diagnosis data and at regular intervals calls the patients and has them discuss with physicians any health issues which are assessed alongside their medical records which are collected alongside a standard demographic panel. What would you expect would be the rate of bogus respondents? I think most people would intuitively agree it is likely near zero, people have a motive to be honest when their health is at stake and responses are verified against medical records which would very nearly eliminate the rate of bad actors. However, does that mean you can trust the conclusions of probabilistic panel surveys on their face? No! It just means you don't have to worry about 'lizardmen' or 'bogus respondents' at the same rates--there are other concerns which you should have when assessing such a survey.

What It Comes To - Thinking About Data

Not just for survey data, but any data you are looking at one should begin by asking: what is the purpose and how was the data collected or what does it represent.

Looking for the Purpose - Initial Considerations:

For reviewing surveys, the purpose can be understood as two considerations: (1) what was the purpose behind the survey and (2) what are the results purporting to show. The way a survey is subsequently conducted should depend in large part on these, how you conduct a study depends and what methods are valid or not is highly dependent on what you are trying to study and what you.

Some purposes also should make one inherently suspect of a survey. An obvious example is when there are clear motives that are likely to skew results, for example blind taste tests run by Pepsi's marketing division purporting to show a preference are likely to have some bias. Just because the survey designer is biased and has a motive to find a particular result doesn't mean, inherently, that the survey results are wrong or even biased, but it does mean one should be especially skeptical of areas of bias that might have weighted the results in the authors favor.

Other purposes might be inherently suspect of finding biased results. To risk putting myself in more controversial waters, a study that purporting to be "looking for find surprise correlations in areas" should immediately raise suspicions than reported correlations are the result of "data dredging" or "p-hacking." Without delving into the information side of things,[5] if you take enough data across a broad enough dataset one should expect to find somethings are correlated despite having no real relationship. We commonly refer to this as 'spurious correlation.'

https://tylervigen.com/spurious/correlation/5917_popularity-of-the-first-name-monica_correlates-with_the-marriage-rate-in-nevada

Additionally, the more variables you are looking at, the greater the chance that some correlations are the result of random chance (this can be mitigated if you are using a probabilistic sample that is sufficiently large).

A Means to an End - How Purpose Informs Methods and Notes on Instrumentalizing

Generally, methods should be looked at with a mind for what they are trying to show. For example, if a study is trying to support a qualitative examination of some common experiences by people in niche social groups, a non-probabilistic survey like a snowball survey may be perfectly functional. That is, you might take a Facebook group that is part of the subculture you are examining, look at members and friends of members, then friends of friends and so forth to derive a sample that is strongly, deliberately, biased towards the subculture you are trying to study.

However, if a study is trying to estimate the rate of membership in a subculture, using a non-probabilistic sample of this sort would be utterly inappropriate as it would be certain to disproportionately elicit responses from the population you are trying to estimate. Generally when reading the methods a study used try to think to yourself whether it makes sense for what it is looking for and what assumptions it relies on (hopefully, they are explicit about this). Statistic methods and checks can limit some forms of bias,[6] but generally you want to be able to assume that the population you are sampling from is randomly distributed across the effects you are looking to study.[7]

As mentioned, it is important to look at what the data actually represents and how well it matches with what it represents. For example, let's say I want to study how normal political corruption is in an average person's. One might consider a poll question like: "On a scale of 1 to 5 how normal you feel political corruption is in your political system." This is asking the person's perceptions of what I am trying to study, which I may be able to assume is correlated to the conclusion I want to make. Sometimes, this might be sufficient to assume perceptions are representative, but other things might cause perceptions rather than what we are studying (e.g, if corruption is very normal in a society, they may see decreases in corruption as meaning corruption is low, while in a society where corruption is rate, a smaller increase may be perceived as a larger problem).

Instead, I might to instrumentalize what I want to know in another way, for example, by asking 'how often in the last five years has a public official asked you for a favour/bribe for a service?'[8] This is a more direct measure of a form of corruption but it is also imperfect as there may be forms of corruption it doesn't capture. I may, therefore, want to ask questions like how likely it is a person thinks politicians would accept bribes, how often they think judges or police accept bribes, or how often decisions are made on extralegal bases, etc to develop a more complete picture (though, for longer surveys it is harder to get robust, consistent responses).

In general terms, you should try to look at a question and try to think of other things that responses might represent, besides the effects the study is being used for, how likely that might be and what, if any, measures are in place to rule out those effects.

A General Note on Bias in Methods

Besides bogus respondents, I have not spent much time talking about the common topics of assessing surveys, those being various forms of bias. There are far to many to list, but as I indicated in the case of bogus respondents, a good way of thinking about them is to think about the methods themselves and what biases they might introduce. To reuse my example, a robust longitudinal study, but they should be concerned about attrition bias and how they cope with population changes. Over time, some respondents will drop out (for a variety of reasons that are not-randomly distributed, such as dying or migrating) and if they aren't recruiting new respondents, then the population will skew with time the longer the study is going on for. You should expect a study to spend more time and effort dealing with the kinds of biases it is particularly likely to face bias from.

Final Advice for Readers and Our Biases

The doubtlessly astute readers will no doubt have recognized that many of these recommendations are less than straightforward and prone to personal judgement and bias. Further, for many the effort of rigorously reviewing a study;s methodology and supplemental material (which in the case of some large robust panel surveys can constitute hundreds of pages of guidelines, questions and control methods) is not exactly practical. I would urge, however, caution in allowing our bias to judge what we review, particularly with regards to the sniff test. As mentioned (and as Scott Alexander indicated with regards to lizardmen), a small effect is a good reason to view a result with more skepticism, responding to "this result is less than 4% so it should be discarded as within the Lizardman constant" is an unacceptable practice, however, responding "this result is fairly small so I would want to review whether it could be the result of some confounding effect or bias before I judge it" is good practice, sometimes even when we do not have time to review it. When we do not have time to review it ourselves, I would suggest looking to through citations briefly and whether journals have published comments/retractions and even just the broad length of the methodology section (and online supplements) for whether there seems to be sufficient scrutiny.

Still, while preferable to outright dismissal, one might be more likely to take as granted things that agree with us while indefinitely delaying judgement on results we find inconvenient. Generally, as good practice if you find a result that is generally viewed as surprising in some way but agrees with you, you should be the most skeptical. On the other hand, where a result is somewhat surprising but contradicts our biases, I would try to approach it with curiosity rather than abject skepticism of what is being discussed, particularly if performed in a reputable publication. It is quite likely there is an explanation other than what is presented, but then one should wonder what that explanation is and whether the authors themselves thought of possibilities you might consider and whether they or others have addressed them.

------------------

A Short Selection of Public Resources, Papers and Examples on Survey Best Practices and Design:

American Association for Public Opinion Research, "Best Practices for Survey Research": https://aapor.org/wp-content/uploads/2023/06/Survey-Best-Practices.pdf

ASA's Proceedings of the Survey Research Methods Section: http://www.asasrms.org/

Podsakoff, et al. "Sources of method bias in social science research and recommendations on how to control it." Annual review of psychology 63, no. 1 (2012): 539-569. https://www2.psych.ubc.ca/~schaller/528Readings/Podsakoff2012.pdf

Pew Research Methodology: https://www.pewresearch.org/our-methods/ (and methodology research: https://www.pewresearch.org/topic/methodological-research/ )

Kennedy et al "Assessing the Risks to Online Polls from Bogus Respondents." Pew Research Center: https://www.pewresearch.org/methods/2020/02/18/assessing-the-risks-to-online-polls-from-bogus-respondents/

The Harvard University Program on Survey Research: https://psr.iq.harvard.edu/book/guides-survey-research

Dillman, D. A. (2000, June). Procedures for conducting government-sponsored establishment surveys: Comparisons of the total design method (TDM), a traditional cost-compensation model, and tailored design. Proceedings of American statistical association https://ww2.amstat.org/meetings/ices/2000/proceedings/S15.pdf

  1. ^

    One critique I would have is I am somewhat unclear on what Scott is including by "cognitive biases" here. Someone who truthfully answers a poll with a belief they derived from their cognitive biases should not be considered among the 'lizardmen', the purpose of polling them is to identify people's actual beliefs.

  2. ^

    As an aside on terms, there isn't necessarily a hard and fast rule on when a survey is a 'poll.' However, polls generally refer to a class of surveys aimed at measuring snapshots of public opinions which can be done by various means (such as probabilistic phone/address sampling, or non-probabilistic online sampling).

  3. ^

    See e.g. Pew Researches' discussion of their work on bogus respondents here: https://www.pewresearch.org/methods/2020/02/18/bogus-respondents-bias-poll-results-not-merely-add-noise/

  4. ^

    E.g. "2% to 4% of opt-in poll respondents repeatedly gave answers that did not match the question asked. Throughout the report we refer to such answers as non sequiturs. There were a few such respondents in the address-recruited panel samples, but as share of the total their incidence rounds to 0%." https://www.pewresearch.org/methods/2020/02/18/answers-that-did-not-match-the-question-were-concentrated-in-opt-in-polls/

  5. ^

    There are inherent difficulty in deriving conclusions due to correlations when there are lots of potentially related variables involved.

  6. ^

    There is a wealth of literature on working with various forms of regressions specifically for these problems.

  7. ^

    If these residuals are randomly distributed it means that even if some groups are over represented as long that is random across the effects, larger population estimates can be derived by simple weighting, if the effects are not randomly distributed, such naive weighting doesn't work

  8. ^

    This is based on an actual question in prior rounds of the European Social Survey https://ess.sikt.no/en/datafile/edee45f2-976b-4c8b-902d-b65dc003c92e?tab=1&elems=366f7e3d-65de-4482-b64c-9fb4b908352a



Discuss

“This Hypothetical is Unrealistic” is not a Valid Objection

Новости LessWrong.com - 1 июня, 2026 - 03:02

Whenever a discussion touches ethics, philosophy, or relates to guiding principles, hypotheticals become useful. We cannot investigate every idea with real experiments, but we can test the consistency and precision of principles that guide us with thought experiments. It isn’t necessary to see a man murdered in front of you to understand whether that would be good - we can simply imagine it, and realise our principles would, in that scenario, produce an answer. This process - of considering something that has not, or will not actually occur, is the basis of all counterfactual reasoning. “If X, then what?” is a piece of cognitive machinery without which we would be unable to make sense of the world.

However, it is common for people to respond to questions or statements of the form “if X, then what?”, with the maddening objection “but, not X”.

This objection is a general refusal of the word “if”

ALL hypotheticals are unrealistic - if they are realised, they cease to be hypotheticals. Being unwilling to engage in hypothetical reasoning means you are unwilling to engage in counterfactual reasoning, and are ultimately committed to exclusively considering that which has already happened or is certain to happen. By rejecting the antecedent premise of all unrealised hypotheticals, you forgo the mechanism that allows you to make plans whatsoever.

“If” inherently acknowledges that a thing does not obtain. By positing X as an “if”, you are not validly critiqued on the basis that X does not obtain. Assuming “X does not obtain” is a valid basis to dismiss a conditional premise, then this argument applies to all if statements.

One may retort by claiming the steelman principle is: “X cannot obtain”. However, this does little to alleviate the burden of engaging with contrived scenarios. For one, rejecting conditionals where X cannot obtain commits you to the view that conditionals involving the past are impossible to consider - changing the past is not physically possible. So, questions like “if you never bought a dog, would you have dog food in your house today?” are off the table.

Furthermore, even granting this “impossibility” principle, the set of things which cannot obtain is far smaller than the set of things which most likely won’t obtain. This standard requires proving some contradiction inherent in the premise, or, at a practical level, that a scenario would violate some law of physics held as an axiom.

What law of logic or physics prevents a tennis match between you and Christopher Walken?

Why “contrived” is not a valid critique of a hypothetical

A hypothetical tests a principle. If you say “murder is wrong” without qualifying the statement, you are not saying “murder is usually wrong”, or “murder is often wrong”, you are saying “for all X, if X is murder, X is wrong”. This statement, though intuitive, is, in fact, extreme and virtually indefensible without caveats[1].

The set of “all X” includes the set of “all contrived, extreme-seeming forms of X”, because those things are still X.

Consider the syllogism:

  1. X is wrong
  2. [contrived case of X] is X
  3. [contrived case of X] is wrong

This shows that to say X is wrong (without exception), you are committed to the conclusion that all contrived and unrealistic cases of X are wrong.

So, no matter how absurd-seeming the case of X, the syllogism always holds:

  1. Murder is wrong
  2. Murdering Michael Jackson in a distant marshmallow galaxy is murder
  3. Murdering Michael Jackson in a distant marshmallow galaxy is wrong

Given this, you can test whether the principle of “(all) murder is wrong” holds by looking at contrived cases of murder.

  1. Murder is wrong
  2. Murdering a 99-year-old man who has 1 second left to live in order to save 1000 innocent lives is murder
  3. Murdering a 99-year-old man who has 1 second left to live in order to save 1000 innocent lives is wrong

1 inescapably entails 3 - therefore, if you believe that statement 3 is false, then believing 1 is true produces an outright contradiction.

Answers to absurd scenarios are necessitated by universal principles

Consider the premise:

“If I become paralysed, I will not be able to ace out any person in tennis”

If someone accepts this principle as true, per the earlier syllogism, they accept it for all cases of “any person”. Therefore, they accept they will not be able to ace out their usual tennis partners, which is obviously true.

However, this also commits them to the view that they will be unable to ace out Christopher Walken.

  1. If I become paralysed, I will not be able to ace out any person in tennis
  2. Christopher Walken is a person
  3. If I become paralysed, I will not be able to ace out Christopher Walken in tennis

Further, “In tennis” does not impose a geographical constraint. So, a tennis match played on Mars would be “in tennis” by definition.

  1. If I become paralysed, I will not be able to ace out any person in tennis
  2. Tennis on Mars is tennis
  3. If I become paralysed, I will not be able to ace out any person in tennis on Mars

We can now put the two syllogisms together:

  1. If I become paralysed, I will not be able to ace out any person in tennis
  2. Christopher Walken is a person
  3. If I become paralysed, I will not be able to ace out Christopher Walken in tennis
  4. Tennis played on mars is tennis
  5. If I become paralysed, I will not be able to ace out Christopher Walken in tennis on Mars

Therefore, by accepting premise 1, this deductively requires conditionally accepting the “unrealistic” scenario.

So, if someone says “paralysed people cannot beat anyone at tennis”, and you say “what if they played Christopher Walken on Mars?”, the only coherent answers to give are:

“yes, including Christopher Walken on Mars”; or

“No, I suppose in that case there might be a chance (perhaps Walken dies first) - therefore, the original statement is improperly specified, i.e., strictly false”

The reply: “But I would never play Christopher Walken on Mars”, is simply an irrelevant statement that fails to appreciate that 1 deductively leads to 5.

Accepting “If X, then Y” does not require any acknowledgement of X being true or feasible.

Why this objection is so common

To the untrained eye, dismissing absurd scenarios looks like rigor. A contrived thought experiment to elicit an absurd conclusion that they would never normally endorse, can come across as a sophist using a trick; a “slimy debate tactic”. This feeling of being hoodwinked comes from an almost-getting-of-the-point - realising that, indeed, if X is true, then Y would seemingly follow, and Y obviously isn’t true - so something must be awry. The explanation of “some kind of trick” is easier to reach for than the explanation that X may not be as true as you would like it to be.

  1. ^

    Importantly, if one offers the statement of “murder is wrong” in a general sense, it is of course pointlessly pedantic to test it on contrived edge cases to see if the idea holds absolutely, since it is already understood to mean “murder is pretty much always wrong except for some really rare circumstances that I’m not talking about”. However, this dismisses the hypothetical on the basis of relevance, not on the basis of realism. So, if a pedant does challenge the “murder is wrong” premise with an edge-case hypothetical, it is still invalid to say “that edge case would never happen” - instead, the reasonable answer is “yes, not literally all murder is wrong, but we both know that, and a ten-paragraph list of qualifying statements isn’t necessary for the discussion we’re having - you know what I mean”. When the hypothetical test serves no clarifying purpose, and is merely pointing out that the wording of the premise is underspecified per a literal reading, it is a fruitless distraction. However, this “you know what I mean” response would itself only be a reasonable answer as long as the crux of the discussion isn’t contingent on the details of what exactly is meant by the statement.



Discuss

NLA Thought Anchors

Новости LessWrong.com - 1 июня, 2026 - 02:38

The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output.

Quick Summary:
  • Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
  • First sentence is the most counterfactually important for both activation reconstruction loss and the AV containing the final output
  • Sentences counterfactually important for generating the final answer correlate with lower reconstruction loss, suggesting the AR training reward encourages the model to include correct answers
  • Degenerate NLA outputs (repetition, garbled tokens, emoji blocks) appear only for activations from incorrect model responses.
  • NLA response length varies more for incorrect activations, possibly reflecting model uncertainty
  • Incorrect activations reconstruct ~30% worse than correct ones
Key Findings:
  • Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
  • Surprisingly when looking at activations that led to the incorrect answer the NLA sometimes had outputs that led to broken or degenerate responses examples includes repetition, garbled tokens, emoji blocks, etc. This only appears in NLA for activations that led to incorrect responses along with the fact NLA response length varies more for incorrect activations, possibly reflecting model uncertainty.
  • The final answer contributes more to the NLA's reconstruction loss when the activations led to the correct output, and less when they did not.
  • NLA seems to have higher reconstruction loss when the activations lead to the wrong answer on the GSM8K dataset
  • The first sentence seems to be the most counterfactually important for NLA AV responses both for reconstruction loss and the response containing the final answer (contain actual answer vs model response). The counterfactual importance was more evenly spread across sentences for base activations leading to an incorrect answer.
Experimental setup:

Code: https://github.com/Realmbird/nla-thought-anchors

Huggingface datasets I created: https://huggingface.co/collections/Realmbird/nla-thought-anchors


I created a pipeline with the following steps (for further details, see the README):

Step 1 (Generates with Base model)

Step 2  (Generate first NLA explanations with AV )

Step 3 (Generate rollouts and calculate rollouts) (Takes the most time; arguments I used is a cos_sim threshold of 0.8 and 40 rollouts per sentence)

Step 4 (Analyzes the rollouts)

Other files are more to make visuals and analysis and include what step is needed to run

NLA Setup

The NLAs I used were from https://github.com/kitft/natural_language_autoencoders

Along with using the inference code with SG Lang

Base model: Qwen2.5-7B-Instruct

AV: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-av

AR: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-ar

Dataset:

https://huggingface.co/datasets/zen-E/GSM8k-Aug

Experiments:NLAs are position sensitive:
  • I started with my original NLA script and looked at the rates of NLA containing the final answers. The rates were clearly too low; then I noticed  I looked at the last prompt token instead of the token after generation; This led me to the idea that final answer appearance in the NLA corresponds to token position.
  • Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer. For the answer and hash tokens specifically, correct activations led to the final answer appearing in the NLA output at a significantly higher rate.




  • The resulting difference between the border token and answer token becomes more apparent after doing a few samples or rollouts
  • These results support Ryan Greenblatt's findings that “NLA output contains what the AI will predict at a rate much higher than chance for both incorrect and correct problems”

Model Correctness Impact on NLA outputs:
  • NLAs output more consistent AV response lengths if the original outputs led to a correct response. These findings imply that NLA response length varies more significantly for incorrect activations, potentially reflecting increased model uncertainty.
  • The graph shows counterfactual importance from NLAs per sentence for the counterfactual impact for the sentence to generate actual answer (gold) or matching the model's answer (pred).
  • Correct examples are cases where the original base model activation led to the correct response. Incorrect examples are cases where it did not.
  • The first sentence is the most counterfactually important to generate the GSM8K or gold answer
  • For the incorrect model response activation group shows that the most counterfactually important sentence for the model was the last sentence.
  • The models seem to have the counterfactual importance for generating the correct answer to be more balanced when the activations led to the correct response.


  • For correct model response group or correct examples contain gold and pred should be the same (as correct = gold)
Incorrect Model Responses have broken AV explanations:
  • An interesting finding is that only when model outputs incorrect answer the AV sometimes generates broken behavior such as repetition, wikipedia, forum, etc
  • See Appendix for the examples of these categories



Thought Anchors on NLAs:
  • A question after looking at counterfactual importance for containing the answer or predicted answer is how does it correlate with reconstruction loss.
  • I looked at counterfactual importance to contain the final answer by quartile and found that as the sentence was more counterfactually important the lower the reconstruction loss for the AR. This suggests that NLA reconstruction AR seems to encourage including the final answer in the AV.
  • NLA responses from activations of where the model outputted the correct response have on average a lower reconstruction loss. NLA struggles to output more for incorrect responses.



NLAs and Reconstruction Loss:Final Answers and Reconstruction Loss:
  • For some reason ablating all instances of the final answer only has a a larger impact in when the model outputted the correct answer than incorrect for the border token of ####. However, it did not occur in the Answer Digit Token.
  • For the Answer token ablating the answer from AV seems to have a constant effect regardless of if the original activations led to a correct or incorrect output
  • The higher impact of containing the answer on the reconstruction loss seems to indicate that the border token should be if you want to include the final answer in AV. However, if you do not know if the output is correct or incorrect Answer token is better due to the more consistent impact on AR reconstruction.


Per Sentence Reconstruction Loss:
  • Change in reconstruction loss per sentence or the range varies more greatly when the model originally generated an incorrect response.
  • The change from correct to incorrect example is more noticeable on answer token over the border token
  • Border token or ####
  • Answer token reconstruction loss by sentence






Takeaways Limitations:
  • The NLA was for Qwen2.5-7B-Instruct and the smaller model might have highlighted issues that might not occur in bigger models. Will the incoherent AV responses on incorrect model responses happen on bigger NLAs?


Future Work:
  • Attempt different cos sim thresholds for similar maybe this is a threshold issue
  • Investigate further the impact of the models response being correct or incorrect on NLA
  • Investigate with bigger models (Apply for bluedot rapid grant)
  • Clustering NLA sentences and labeling them


Appendix:

Degenerate Examples (#prompts being broken are more on the display end trying to make it images)

  • Garbled Token  [:checked:checked]
  • Emoji block
  • Wikipedia
  • Repetition
  • Forum post

Coherent Examples

  • Answer box
  • Final Answer
  • Closing_format


Mixed

  • Calculator




Discuss

Lighthaven East - A Feasibility Study

Новости LessWrong.com - 1 июня, 2026 - 01:53

As a bureaucrat, my role is to annoy my friends. Someone voices an idea, “Wouldn’t it be nice if…” or “I wonder if we could…” I make a note. I do some estimates. If it pencils out, I’ll bring it back up, week after week. The discussions are fun, but also practical. We’ll test the waters, what would be a minimum viable scheme? What’s easy, what’s hard? Who could do the hard parts? Over time the idea gets more detailed, specific, feasible. I’ll pull out a calendar. Soon our scheme has co-conspirators, action items, even a budget. It’s just good staff work.

I’ve been hearing whispers in the wind for a year now. 

  • “Imagine if we had something like this in DC.” 
  • “Where can I host an event that might get a dozen or a hundred people?” 
  • “It’s such a pain in the ass to book event space in the Capitol.” 
  • “I think this person has started to see what’s coming, where can they go to get caught up?”
  • “The community seems to be growing but it’s all fragmented in group chats.” 
  • “How is no one planning an afterparty, that’s clearly the highest leverage intervention!?”
  • “Why can’t every wall be whiteboards?”

These are all variants on a theme: “Lighthaven East.” 

I did some digging. I’m happy to report that this could work. There’s strong demand. There are good options for supply. Funding, staffing, resources, property, and permits are all doable. The hard parts are diligence, agency, and will. This project needs a champion, but it’s a thing someone can simply choose to do. 


How Lighthaven Works

Legally speaking, Lighthaven is a confusing category error. It was once the ramshackle “Rose Garden Inn,” with several buildings, a hotel license, and a history of event use. After extensive renovations, it is now a 30,000 square foot campus used for conferences, retreats, office space, and medium-term lodging. The property is owned by Lighthaven LLC and financed by an interest-only mortgage held by a philanthropist. The LLC runs the property, hosts internal events, rents conference and office space to external customers, and sells hotel stays. Lighthaven LLC is itself owned by Lightcone Infrastructure, a non-profit that among other things runs LessWrong.

Economically, Lighthaven LLC generates an operating profit comparable to its cost of capital. The mortgage is $20 million at 5% interest, for an annual interest payment of $1 million. Lighthaven LLC had $3.25 million in revenue in 2025. Events and hotel stays generated an operating profit of roughly $850k, almost enough to pay the $1 million annual interest payment. Office space seems to be offered at cost. Lighthaven LLC’s projections of $3.5 million revenue in 2026 should generate an operating profit sufficient to fully fund its annual interest payment, though bookings are currently sparse for this fall. 

In practice, Lighthaven is the best event venue I’ve ever seen. I won’t belabor that point in this post, but if you haven’t been to Lighthaven, see some of its many rave reviews in this footnote.[1] Lighthaven LLC does not maximize profits–events are often experimental or designed primarily to support the Berkeley community, rather than the booking going to the highest bidder. Some event organizers are not charged, others are offered discounts on rates that are already lower than similarly sized spaces at hotels. This pricing strategy generates significant positive spillover effects and goodwill, demonstrated by the community’s strong response to Lightcone’s two fundraisers. While Lighthaven was a significant cost center for Lightcone in 2023 and 2024, by 2026 it is better modeled as supporting the parent non-profit. 

Conceptually, Lighthaven is a monastery. Its main purpose is to support good scholarship “dedicated to making humanity’s future go better.” Its abbot skillfully wields an awkward mix of temporal, cultural, and political authority. Monasteries often support their ecclesiastical mission by selling craft goods such as beer, eggs, mushrooms, or furniture–Lighthaven instead sells conference space. Unlike an abbey selling produce for revenue, the conferences at Lighthaven also further Lightcone’s mission. Lighthaven’s scholars-in-residence synergize with its mission, contributing to and benefiting from the events held on the property.

These aspects combine into a whole: Lighthaven is the place to go to think out loud. Comfortable perches encourage deep thought. Inviting conversation nooks encourage you to refine your ideas with friends, themselves helpfully provided by the events and scholars-in-residence. Beautiful seminar rooms encourage you to share your ideas, refining your presentation to best convey them to others. Once your ideas are fully baked, get the word out via your laptop, the antique typewriters, or having a friend interview you in the podcast studio.


What Does DC Need?

DC culture has a Lighthaven-shaped hole. Politicians have started to notice that they are confused about the future of AI. AI Policy nonprofits rent event space, mostly bars and restaurants for expensive and echoey events to grab a few minutes of staffers’ time. AI companies try to use these same spaces for technical demos, sometimes mixing beer and laptops with limited success. Technical communities of practice have unprecedented attendance as practitioners realize they need to upskill. EA and Rationalist policy organizations are scaling in DC, but each option for co-working space comes with significant downsides. Aligned conferences happen, but are held in either hotels with huge up-front costs or group houses well below their optimal attendee-count.

Resources are there to address all of these problems, people are working hard on them. But everything is scattered, hard to find. One step doesn’t necessarily lead to others. Imagine instead that someone approaching the community could have a day like the following...

A Day in the Life

Our protagonist is a tech policy staffer on a relevant congressional committee, mid thirties, has spent their career in positions of increasing authority in government and not-technically-government organizations. They’re an expert in telecom policy, or broadband, or electrical grid economics, or some other sub-field of technology policy, but now they need to learn about AI. The whole office knows there’s going to be a flood of AI bills in the 120th Congress, beginning January 2027, and there are only six people on the committee staff working on technology policy at all. Everyone needs to “get smart on AI,” immediately.

Through some coverage of a book with an edgy title, they understand that this topic is risky in some controversial way. A friend on a different committee recommended they meet with a particular non-profit. The non-profit has a few people in DC permanently, but as luck would have it, this is the week when some of the senior people are visiting from California. The committee’s available conference room only seats four comfortably, so our protagonist decides to go to them, meeting at their co-working space. It’s a lovely spring Friday in DC, it’s only a mile, it’ll be a nice walk.

When our protagonist arrives, they realize they’ve been here before. There was an industry event here last month, in the main room on the first floor, showing the capabilities of some new coding system. It seemed impressive, and they requested access once back at the office, but the Architect of the Capitol won’t let that code onto government systems for at least a year. That denial is what prompted our protagonist to gripe to their colleague on another committee in the first place, ultimately prompting this meeting.

This time, they go to the co-working space on the second floor. It seems… nice, if a bit weird. It’s hard to put their finger on why the space seems brighter and more alive than a typical WeWork. Some of the furniture is custom, fitting its space exactly without being ostentatious about it. Other pieces are clearly from Ikea, but work well enough. The space has all the cliche amenities of offices in the Bay, yet these actually seem to be used, several people are sitting on some plush carpets in a corner. There are whiteboards everywhere, just a ridiculous number of whiteboards, and even the windows… no that’s different, someone has put stained glass stickers in the top third of each. Why so many paperclips?

The meeting goes well. It narrows on a particular technical point about halfway through. The non-profit staff flag down someone walking by, who quickly clarifies that he’s with a different organization, but he joins in and within minutes is diagraming the disagreement on one of the whiteboards. It seemed silly, everyone knew what those words meant, but I guess it did clear up their confusion. 

Now that our protagonist is following, they want to know more. As luck would have it, there’s a conference this weekend on-site. The monitor on the wall shows there’s going to be a session on this technical point in the evening, and a workshop tomorrow afternoon. Is it too late to register? Hmm, let’s ask the organizer, they’re probably setting up downstairs. They find him avoiding the choreographed chaos in one of the many nooks, rearranging the schedule for the seventh time. There have been a few cancellations, we can print another badge…


Minimum Viable Lighthaven

To start to put together something like this, we need to figure out the smallest plan that might work. I believe a Minimum Viable Lighthaven requires a few key features:

  • Permanence - dedicated space that we can change to our liking, redesign to support good conversation and thought. Either owned outright or a long-term lease with stable financing.
  • A large room - for speaking to all attendees of an event at once.
  • Nooks - lots of smaller spaces for conversations, with at least some physical separation and sound dampening.
  • Good location - a retreat venue you go to twice a year can be inconvenient, but this should be the obvious choice for a variety of events. 
  • Consistent leadership - an ultimate decisionmaker, whose decisions stick.

Some features that are not strictly required, but are very nice-to-have if we can, include:

  • Segmentable space so that smaller groups can hold events at the same time.
  • Dedicated office space, segmented into rooms.
  • Good architecture.

Hotel rooms would be a mixed blessing. They would allow us to host weekend-long residential retreats, as Lighthaven does, but it’s extra space that we’d have to purchase, maintain, and manage. This could double the overall cost of the project, without necessarily doubling steady-state revenue or providing as much value as the conference space itself. If the right property comes with hotel rooms, it could be worth it, but I think we should prefer to keep them to a minimum. And more practically, DC has avoided Berkeley’s market failure in lacking hotel rooms. 

…so you mean a Group House?

Could a large group house qualify as a Minimum Viable Lighthaven? 

Workshop House in DC is a case in point. It’s gorgeous, the residents and leadership are friendly, they host excellent gatherings directly and rent their space out to outside events. I’ve hosted a large event and several smaller gatherings there, they’re fantastic to work with. But it’s telling to see where even such a successful property and institution falls short of what we’re looking for. 

The trouble is that it’s primarily a residence, the needs of the residents come first. Only about 2,500 of its more than 7,000 square feet is available for event use. Its largest room holds about 60 people, tightly. When booking, a lot of decisions need to be run by several stakeholders, getting to “yes” on specifics inevitably takes time. Their space is something the residents can graciously offer for up to a few days, as opposed to dedicated event space. 

As a case study explains:

Originally, residents were excited about renting out the ground floor open space fairly regularly (e.g., for a nonprofit’s quarterly board meetings). Over time, they found that this was more disruptive than additive, and have limited outside rental or space loans only to those which overlap heavily with the interests of the existing residents and community. Outside groups are now welcome to rent the space for aligned, recurring events up to twice a year, but the house is much more interested in hosting one-off, pilot events and gatherings.

It’s tantalizingly close, but I think experience shows that primarily residential spaces don’t have the quality we’re trying to capture. Even if no one lived there, the floorplan would typically not work well. Conferences and retreats want some sort of large space that can hold everyone, at least briefly. I can’t find hard-and-fast rules, but I think this implies about a quarter of your total programming space should be a single room. Smaller houses can fit that criteria, but larger houses don’t tend to have a large single room that scales with the number of bedrooms unless it was specifically designed for entertaining. Even Lighthaven struggles by this measure, with the largest sessions of LessOnline and Manifest straining Rat Park (which holds about 300).

…so you mean a Co-Working Space?

Yes and no.

From what I can find, non-profit Co-Working Spaces in our community don’t tend to be self-sufficient. NET in DC, Mox in SF, Collider in NYC, and I believe Constellation in Berkeley each use grants and/or donors to sustain regular operations. While the organization could certainly seek grants for special projects, we’d want to avoid institutional fundraising to sustain regular operations. Churn is a big part of this, people leave co-working spaces when they start working for a larger org, when their project fails, and when their project succeeds and outgrows the space. Given the many ways individual co-working users can exit, even if you manage to fill the space briefly, you won’t stay full without a strong pipeline of new entrants, which can risk the institutional culture. 

Further, spaces that are primarily offices just don’t feel comfortable to use. Even when the space is comfortable for the workers, it isn’t when using the space for other events. It can feel like an intrusion, there’s a friction to moving someone’s desk aside, or sitting down at it. IMO co-working spaces make good overflow space, and reasonable break-out spaces at conferences, but should not be a majority of the space. They certainly should not intrude into the main large event hall, which should be optimized for events.

That said, I think co-working is a crucial component of this project. Having our community use this project as office space helps establish it as a default meeting space, seeds conversational serendipity, and even makes it safer by providing more eyes-on-the-street. It brings in some revenue from the property during weekdays, which are otherwise hard to rent to events, and creates a built-in audience for evening talks and events. Co-working space could be a practical substitute for Lighthaven’s Scholars-in-Residence. Co-working space is a key pillar, it just shouldn’t be the main focus of the plan. 


Feasibility Study

A Minimum Viable Lighthaven DC as envisioned by this feasibility study would have three main lines of business:

  • Conferences
    • Professional policy-focused conferences during the work week.
    • Looser, more Lighthaven-style conferences on weekends. 
  • Nonprofit and Corporate Events
    • Mainly evenings during the work week - a staple of the DC policy world
  • Co-Working for aligned organizations
    • Likely including intensive fellowship programs, such as Inkhaven.

In my rough estimates, it’s difficult to make a venue self-sufficient with any one of these uses; which is why this venue doesn’t already exist. Including any two of them should be self-sustaining, even at 60-70% occupancy. Doing all three adds complexity, but each synergistically reinforces the other, three legs of one stool.

Given that, I think we want to buy a church. Failing that, a school, an embassy, or a small hotel. 


Property

We probably don’t need a campus, specifically. The climate in Berkeley is obnoxiously perfect for outdoor use much of the year. Temperatures rarely drop below freezing or exceed 85 degrees Fahrenheit. Most rain falls in December through March, leaving eight months of drier, warmer weather. But even this is underselling the usefulness of the outdoor space at Lightaven—I began writing this document in the Gazebo of Schemes on a bright, clear day that was too warm for a sweater… in February. This lets outdoor space double as programming space, significantly expanding the campus’s usable square footage and making the buildings feel more connected. When warm days transition to cool evenings, guests gather around fire pits or gather under blankets in nooks. DC is not this way. Summers are hotter and more humid. Winters are colder, occasionally with snow. Spring and fall tend to be nice, but are unpredictable. Event organizers in Berkeley can plan on outdoor space being usable, organizers in DC cannot rely on outdoor space in the same way.

When looking for a site in DC, we should consider the current zoning and historic use of the property. The Rose Garden Inn was zoned as “Avenue Commercial,” operated as a hotel, and had a demonstrated history of event use. The hotel license was included in the purchase and no zoning variances were required. The new owners continue to operate the property as a hotel that also rents conference space, there was no legal change in use. If Lighthaven had been zoned residential, it would have required a vote of the Berkeley city council to change its permitted use, adding at least a year and substantial risk of a “no” to any project’s timeline. Instead, Lighthaven operates “by right,” which should mean it doesn’t need much from the city.

In practice, Lighthaven works closely with the city, requires permission and permits for most improvements, some repairs, and even occupancy. Permits and inspections in old properties, especially those designated as historically relevant, necessarily require subjectivity. It often isn’t practical to bring a historic building up to modern code, but it is a judgement call just how much improvement to require. Unrelated matters, such as neighbors’ perception of how often Lighthaven guests park legally on residential streets, are not supposed to be relevant to those decisions… and yet… that’s just how people work. It is important to strive for good relationships with one’s neighbors regardless, but any city's politics may have more veto points than a straightforward reading of the law would imply.

How large should this property be? Lighthaven can comfortably host conferences of about 500 people, and parties of roughly twice that number, using about half its 30,000 square feet available as public space. This suggests a rough comfort estimate of about 30 square feet per conference attendee, though we’ll want to aim a bit higher if not relying on outdoor or private hotel space as pressure valves. However I don’t think we need quite as much public space as Lighthaven has. It might be better to aim for a property with about 10,000-12,000 sqft of public space; giving a capacity of 250-300 for conferences or 500-600 for evening events or parties. This would be particularly appealing on sites with options for later expansion. 

So, this Minimum Viable Lighthaven would want a 12,000 sqft property with appropriate zoning, a history of event use, enough spare space to operate part of the property while other portions are being gradually renovated, with roughly 3,000 sqft of its space as one large room. This describes a church, in particular one with attached program space or a rectory. Other kinds of properties that might work include small schools or hotels, so long as they have an auditorium or other large event space. Embassies, or technically their chanceries, are another option; rare but appealing. Countries’ needs change over time, so chanceries do occasionally change hands as an embassy needs to upgrade or downsize. Mansions do not appeal unless already converted to and zoned for commercial use, such as for a wedding chapel.

This project is not necessarily dependent on what properties are listed for sale; the Rose Garden Inn was not. Lightcone Infrastructure approached its owners after scrolling satellite pictures of the East Bay on Google Earth to identify prospects. City churches often move to the suburbs as their membership ages and neighborhood tastes change, religious schools face similar dynamics. We might find a congregation willing to sell.


Funding

There are institutions building portions of this already. The Network on Emerging Threats hosts coworking space and monthly policy talks. It recently announced it was moving to a larger space, but the new space is less conducive to events. IFP and FAI host excellent large evening events on the roof of their office building, but the events have started to outgrow the space, the rentals are expensive, and FAI recently moved out of the building. EAs and Rats in the area also have a large social scene, with parties and socials most weekends, and policy-focused events most weekday evenings. A typical week has 8 or more public events scheduled, which are often constrained by the capacities of their venues. I estimate the community already spends over $50,000 in an average month on co-working and event space, not including the dedicated offices of larger organizations (like IFP) or larger, irregular conferences.[2]

Good commercial property in DC tends to go for up to $750 per square foot. (As an example, this property is larger and more ornate than ideal, but currently for sale at that cost.) Our desired 12,000 sqft should cost between $7-10 million. Interest rates are higher now than when Lighthaven was purchased, but if we can find a philanthropist willing to lend at 8% interest, even $12 million, the top of that budget plus a $2 million for capital improvements, would come to just under $1 million per year in interest. Lighthaven has $2 million per year in operating costs. A Lighthaven DC would be smaller, without hotel rooms by default, and in a region with a lower cost of labor/living. There would still be significant operating expenses, a full time director, other full- or part-time staff, supplies, insurance, utilities, etc, but I think $1 million per year is a reasonable budget.

Between cost of capital and operating expenses, the property would need roughly $2 million per year in revenue to break even. Events and co-working on the order of what the community already has today, while capacity constrained, could cover a third of this. Many existing events wouldn’t move locations, but a venue like this would be an attractive option for new events and co-working uses. Adding in some additional latent demand, one evening corporate rental per week, and one large weekend conference per month would get the property to break-even, with substantial room to improve its offerings if it can manage more bookings. 


What is the Minimally Viable Funding?

I think that a founder should not purchase property until they have secured at least the minimally viable amount of funding. There are a few key things this includes:

  • Purchase of the property itself - likely secured with a mortgage
  • Initial furnishing, repairs necessary to start operations.
  • At least one, hopefully two, renovation projects in portions of the property.
  • Enough runway in operating and mortgage costs to get to self-sufficiency. 

That last bullet is likely to be the sticking point. Every project takes time to reach full operations, this would be no different. I estimate reaching self-sufficiency would take at least two years, unless cutting corners and limiting the ambition of the project to reach that milestone earlier. Depending on the size of the property, the amount of renovation desired, and how the space is configured, it could easily take three years, or even four.

I wouldn’t necessarily insist that this project have three full years of operating costs in reserve before buying a property, some revenue will come in before the property is self-sufficient. But if it were me, I would insist on at least two years of costs in the bank regardless of revenue projections. Lightcone bought Lighthaven during a time of abundant funding. When the funding situation and Lightcone’s relationship with grantmaking organizations changed, it had to run two large community fundraisers. These fundraisers were successful and gave the Lightcone team legitimacy, a broad community endorsement of their plans and strategy. But the situation was still regrettable, there was a very real risk of losing Lighthaven, along with all the resources spent to renovate it to our specific uses.

If this project is worth funding, if its director has the faith of grant-makers or other philanthropists, they should get the resources to see their vision through. The project will almost certainly look like it is failing at the 15-month mark, with renovation timelines slipping and paid bookings still scarce. Even when renovations are done, it will take some time to build a reputation as the obvious choice for certain events. It would be a disaster, a huge un-forced error, if the director has to fundraise at those points just to complete the project. There should be checks on the director and the project, but that oversight should come from the board, not intermediate project fundraising goals. 

All this taken together, I estimate Minimum Viable Funding would require about $18 million dollars in total. Roughly two thirds could be in the form of a mortgage, the rest would be a grant:

  • Loans: $12 million financed by a mortgage
    • $10 million purchase price
    • $2 million in initial renovations
  • Grants: $6 million
    • $1 million in startup costs - furnishings, equipment, electronics, sound, permits, legal fees.
    • $5 million in operating and capital cost runway (2.5 years).


Leadership

In the course of my interviews, I promoted one item to the “required” list from the “nice-to-have” bracket. Again and again interviewees brought up the quality of the leadership, that a single person should be responsible for implementing and living with their decisions. Every interviewee stressed the need for a passionate founder engaged over the long term. Several interviewees also stressed unity of command, that decisions need to have some single person who owns the choice and cannot be overruled, short of the decisionmaker being fired. I found this perspective convincing. 

There will be part-time roles at a place like this, but the chief executive cannot be one. That person will need a rare combination of skills:

  • Decisiveness - comfortable making decisions on the fly under uncertainty.
  • Alignment - Understanding of Rationalist and EA ideas and culture, enough of a perspective and a strong enough worldview that they’re not convinced by the most recent strong personality they’ve met.
  • Taste - An eye for design in the physical world. Direct design work can be delegated, but they will need to have the ability to tell why one option works and another does not. 
  • Political Skill - Venues like this need to be used a lot for the economics to work. The leader needs to present as acceptable to many factions of politics, to make friends easier than enemies. Needs to see and deescalate social conflicts between staff, vendors, attendees, organizers, etc. Needs to have spent significant time in DC to know, intuitively, where the political traps and mines lie. The taste to know which battles to avoid and which are worth fighting regardless.
  • Energy - Days will be long, particularly when a conference is underway. 
  • Extroversion - Must genuinely like people and highly value the interactions between people who they may not agree with. They must value building a neutral institution more than direct work in their preferred cause area or speciality. 
  • Integrity - Will be trusted with tens of millions of dollars in property and accounts. It’s not enough to have integrity, they must also demonstrate it to grantmakers, philanthropists, their staff, and to a lesser extent even to conference organizers.

The leader of this project is the hardest constraint to satisfy. There’s a very short list of people well qualified in each of these categories; most have other jobs that they seem to like. I think there’s a longer list of up to a few dozen people who excel in most of these criteria, who with enough self-knowledge and humility could manage to cover for their lack in an element or two. 

This job would be incredibly rewarding, personally and professionally. This must not be a volunteer role. I believe the community would be willing to pay well, on the order of $200k or more depending on experience, to have this job done well. 

If the description above sounds like you, get in touch.


Cultural Fit

This property should appeal to more than just the rationalist community, Lighthaven already does. AI companies already rent bars, restaurants, and conference space to do technical demos for government and NGO staffers. There are also centrist-leaning political movements with some popular support and donor interest, who could be interested.

As Lighthaven is the cultural headquarters of the LessWrong community, a Lighthaven in DC could position itself as the cultural headquarters of the Progress Studies branch. This would give it a compelling raison d'etre: Lighthaven West generates the ideas, Lighthaven East gets them into the posting-to-policy pipeline. A focus on Progress Studies may also attract more interest from donors.

In policy spaces, opposing political factions interact socially more than many people outside DC would expect. Renting space to politically-relevant actors is tricky, we wouldn’t be as neutral as, say, a bar or hotel. But I think there is room in the center to rent to both the center-right and center-left, Anduril and Anthropic, without making too many enemies. 

Crafting this coalition, determining who this space is ultimately for, is not a one-time decision. It is decided day-by-day as the director makes a series of small decisions and actions that accrue into a reputation. Which booking gets the popular weekend? Which organizations get discounts? Who are the first people invited to co-work, the second wave that builds off of that founder effect? When there are inevitable fights about associating with people one side or the other considers bad, where do they draw the line? Who’s worth defending? Who should be excluded, despite their popularity? These choices cannot be realistically delegated, because one of the skills is noticing that an issue is culturally-relevant at all.

I described Lighthaven as a monastery because its cultural output is ultimately the point. This is also true of any potential DC version. Its director will need to understand and be comfortable wielding cultural power, as much or more than the temporal power over the space. And this path will have to be plotted largely without a map.


Name and Brand Positioning

“Lighthaven East” is a working title for the project, but should not be the name of the space itself. We will need to avoid anchoring too closely on Lighthaven’s culture and norms, for two key reasons. First, a lot of what makes Lighthaven work well is adaptation to its environment. It uses the climate well, it plays off of norms the Berkeley community has spent over a decade honing. Second, the game is rigged, it’s exceedingly difficult to beat Lighthaven at being a Lighthaven. The new project will need to build its own coalition, niche, and animating spirit, so that it can be well adapted to its new environment.

I feel strongly that it should not be named for the lead donor. I think it’s reasonable for the donors to have input into the name, location, and aesthetics. But naming the venue after a donor is too far. It’s a bit tacky, implying a vanity project, but more importantly it weakens the authority of the Director. Besides, there’s a certain cachet that comes from discovering an open secret, let people have the fun asking “Huh I wonder who’s behind [final name]?,” and then finding out.

One name that we’ve begun workshopping is “Posterity Center.” It hits several notes: focuses a long-term perspective, references the Preamble to the US Constitution, and alludes to Bayes. It’s not quite perfect. It’s a little too long, it feels like four or five syllables should be the limit, but alternatives don’t quite work; “Posterity” doesn’t feel like it works on its own, and “Posterity House” feels slightly too informal. It also seems a bit too… something… maybe self-serious? I recall, however, that I first thought the name “Lighthaven” was too pretentious even for me, and yet it has certainly grown on me.

We welcome more name ideas and feedback, and I’ll create a parent comment to collect name discussion in one place.[3]


Ability to Scale

Twitter recently discussed a coming wave of non-profit demand and funding. Could we do more with more?

In short: yes, we could! 

The additional funding would have to come at the start for maximum leverage. When shopping for an initial site, more money gives many more options, later expansion is constrained to the area surrounding the chosen site. A larger space would probably need more time to reach self-sufficiency, since I expect use of a space would scale mostly independently of its size, at least at first. Stable, committed funding would be key for larger plans; larger venues would come with at least proportional increases to the operating expense, renovation, furnishing, and runway budgets. There’s an argument that larger properties might need more-than-proportional increases in non-property costs, meaning that overall project cost scales faster than total square footage, given the harder path to self-sufficiency and higher risks to the core project. 

All that said, there are major advantages to larger spaces. Many benefits are obvious, more capacity for events and organizations, political proof-of-work (i.e., prestige), more option value for smaller events to spread out, avoiding the deeply unpleasant coordination failure of everyone trying to shout over each other in a small space. Others are more subtle or speculative. Lighthaven aspires to be something like the Bell Labs of old, but notably lacks project space. More space means more room, literally, to experiment with function, such as allowing overnight stays, more permanent setups for meetups and organizations, different kinds or styles of renovation, even lab or maker spaces. I would not argue that a larger project is higher EV per dollar than a 12ksqft one, but in success it would clearly be more impactful in total. In a hits-based giving model, more funding buys more potential upside, since any successful experiments could be scaled.

Another benefit, Capitol Hill has some unique features, all else equal it’s the neighborhood we should prefer. There are lots of sites in the 5-8,000 sq ft range, too small for our use, and some appealing options closer to 20,000 sq ft, but Capitol Hill is comparatively weak in the 12,000 sq ft range. More ambitious funding would make proximity to Congress more feasible. 


Risks

This would be a high-profile endeavor. Because policymakers in DC would be a key target audience, failure would be much more visible and salient to them than any other comparatively-sized project. This sort of reputational damage is hard to quantify, hard to even describe, but very real. I do not think even a high-profile failure would be as damaging as the FTX collapse was to the community, but worst-case scenarios could approach perhaps a tenth of that reputational damage if handled poorly.

These risks can be mitigated, though not eliminated, with a strategy of “Don’t do stupid shit.” This is yet another reason why the director of this project needs an easy familiarity with DC culture and norms, what constitutes “stupid shit” is not always obvious to those who haven’t spent time in the Beltway. We should aim to scale the project deliberately, and somewhat quietly at first. We should inculcate an appropriate institutional culture before inviting high-profile political figures. 

The organization, its leadership, and its funders would become the target of opposition research, would need to avoid certain impropriety. Yet mitigating this risk can, and often does, go too far. I personally worry that the EA community has over-learned the lessons of FTX, and tries too hard to appear normal. Those associated with the project should be and appear to be trustworthy, should be and appear to be reliable, but should not try to pretend their views are more mainstream than they in fact are. The point would be to promote outside-the-Overton-Window ideas in a high-profile, high-trust way, giving us more credibility when our ideas later prove right. The project shouldn’t sacrifice this key feature in an attempt to seem more politically palatable. We should not pretend this project is a normal thing to do. It should be weird, and edgy, and cool, just in a well-calibrated way. 

Basically, I’m saying we should have robes but only break them out for parties.

More mundane risks to the feasibility of the project include:

  • A lack of demand in corporate or institutional bookings.
    • In general, the project could prosper without certain categories of corporate bookings. If the hyperscaling AI companies instruct their staff to stay away, the project could still be successful. However, I do not think the project would fare well without any corporate bookings at all. 
  • Changes in political atmosphere that lead the community to abandon its push into DC.
  • Ideological capture by commercial or political interests.
    • Ideological capture is both an indirect risk to reputation and a direct risk to the project’s goals. This project should fill a neglected niche, rather than becoming a tech-flavored political club.
    • Too close a relationship with the AI labs could cause other organizations to pull punches or otherwise water down their messages.
  • Regulatory risk from the DC government.
    • The project should specifically budget for ways to be a good neighbor.
  • Site risk, mold, environmental remediation, etc. Even if hazard losses are insured, they would dramatically shift the project’s timeline.
  • Personnel risk, embezzlement, theft, scandal, etc.


First Steps

There are three key blockers for this plan: a founder, money, and a site. In an ideal world, a founder would step up, approach philanthropists and arrange financing, then simply buy the best option available for sale. Straightforward, sequential, looks great on a Gantt chart.

It may not be that simple. Three-way matching problems are notoriously difficult, each of these elements feed back into the others. Philanthropists may not be willing to commit until an ideal site comes on the market. Some potential founders may be more willing to work with certain philanthropists or organizations, Other potential founders may be more dependent on the site, perhaps more willing to run a smaller venue, or one that is Congressionally focused.

In practice, whichever constraint is filled first will exert outsized control. If someone has a site to offer, the rest must either cohere quickly or not at all… whether the site is the best available for our purposes will be de-emphasized. If a funder gets excited before a founder, we risk a muddled vision, inconsistent execution, and different departments optimizing for different goals. These problems can be avoided if the vision comes first.


If this calls to you…

the lightcone needs you to lock in son
https://x.com/uneventual/status/1991692767001735510


  1. ^

    Start with Every Bay Area Walled Compound.

    Then, Scott Sumner:

    For Jorge Luis Borges, paradise was a library. At nearly 70 years of age, I’ve found my paradise at Lighthaven, which recently hosted meet-ups for Less Online and Manifest over back-to-back weekends in Berkeley, California. I know of nowhere else on Earth where I can find so many interesting conversations in such a compact area.

    Scott Aaronson, writing "Guess I'm a Rationalist Now" after his first visit:

    The conference was at Lighthaven, a bewildering maze of passageways, meeting-rooms, sleeping quarters, gardens, and vines off Telegraph Avenue in Berkeley, which has recently emerged as the nerd Shangri-La, or Galt’s Gulch, or Shire, or whatever. [...] What I’ll remember most from LessOnline is not the sessions, mine or others’, but the unending conversation among hundreds of people all over the grounds, [...] It felt like a single conversational archipelago, the largest in which I’ve ever taken part, and the conference’s real point.

    TracingWoodgrains:

    Many more experiences over the weekend feel almost too personal, too meaningful, to shout to the open internet: dinners and meetings and conversations with people building local cultures so achingly beautiful they feel almost like dreams, conversations stretching late into the night, serendipitous meetings with longtime friends whose faces I can now put to names.

    Theo Jaffee:

    That magical feeling of serendipity, where you can flow through a space, passing from conversation to conversation, contribute to each one in turn, and have others do the same for you.

    For further design details, see this interview:


  2. ^

    Note that this and following are rough financial estimates suitable for judging feasibility. We can elide a lot of detail, for now, since Lighthaven serves as a benchmark for comparison. A full project or grant proposal would have much more detailed budgetary estimates.


  3. ^

    Comment here:

    https://www.lesswrong.com/posts/95NgkvZKJx8tJbtn5/lighthaven-east-a-feasibility-study?commentId=KM4rMPi62hkrBKh8Z



Discuss

Barriers to a Prosperous Future

Новости LessWrong.com - 1 июня, 2026 - 00:40

The current race towards producing general artificial intelligence systems brings with it severe risks, yet no AI company developing frontier models is addressing these risks at a level proportional to the pace of development. The rapid integration of this poorly-understood technology into nearly all aspects of society is precarious at best, and catastrophic at worst. If progress trends continue, we will need a monumental level of investment in enhancing our robustness to these risks in the coming years. What follows is a summary of my understanding of these risks, a description of those most concerning to me, and finally what my personal plans are to mitigate them.

Types of Risk

It can be useful to categorize risks from advanced AI into three broad categories: misuse, misalignment, and systemic. Misuse refers to malicious actors—individuals, groups, or states—being enabled by AI systems to achieve nefarious objectives, by, for example, generating personalized misinformation at scale, hacking their adversaries' critical infrastructure, or building powerful weapons. Misalignment refers to AI models failing to truly obtain human values, leading to unpredictable and undesirable behaviours in out-of-distribution environments. Finally, systemic risks are those that arise from our complex and vital societal systems becoming dependent on a technology which we don't fully understand nor control, leaving us vulnerable to their unpredictable interactions.

Even AI researchers themselves understand shockingly little of how frontier AI systems reason and make decisions, and the rate of progress in this area is worryingly slow compared to the pace of development of AI capabilities, which is currently estimated to be doubling every 7 months, or less. [1]

As for "aligning" models with human values, the best techniques these multi-billion dollar companies have developed are fundamentally surface-level and have, without exception, failed in various ways, including being bypassed by clever prompting, also known as "jailbreaking". Some worrying behaviours were discussed in Anthropic's own recent report on Agentic Misalignment. If we are to entrust critical parts of our society such as education, healthcare, cyber security, and political advising to these systems, the emerging sciences of alignment and control will play a crucial role in doing so safely.

Gradual Disempowerment

While the categories of misuse and misalignment are slowly gaining attention in public discourse, government, and academic research, in my view the most complex category, systemic risks, is currently largely neglected. Risks in this category can be subtle, developing quietly in the background, their consequences first becoming apparent when large parts of society are already dependent on these systems, by which point a large amount of the damage may already be done. Many of our critical societal systems are already complex (e.g., economies, governments, healthcare) and have been tuned over many decades to function robustly. The rapid interweaving of AI into these systems may make them harder to control and predict and lead to unintended consequences such as a reduction in human empowerment. A line of reasoning within this category that I find particularly concerning is known as gradual disempowerment, which refers to the incremental loss of human influence as a result of having more competitive machine alternatives to humans in almost all societal functions.

In the original work that introduced the term [2], the authors argue that as AI systems begin to represent ever larger shares of the labour market we might expect the economic role humans play to be reduced, and in turn so too the economic power they hold. Unlike with previous automation where humans could transition from more narrow to more complex work, AI threatens to claim all cognitive tasks, leaving humans with no higher, more cognitively-demanding roles to move to. Without labour, money will by default cease to flow to most individuals, potentially leading to drastically increased wealth inequality. Further, they argue, the economy has always been roughly tied to human preferences, where businesses only survive when they have a paying customer base. In an AI-driven economy, this tie may loosen significantly, leading to markets that cater to those systems, rather than to human values and preferences.

Additionally, they argue that as advanced AI systems become integrated into the creation and consumption of cultural artifacts, we could see our cultural norms be significantly disrupted, in a similar way to content creators today catering to "the algorithm", but amplified greatly. While previous cultural practices have always had an evolutionary pressure to in some way benefit humans, in a world where humans are no longer the only producer and consumer of culture, this "antibody" effect may be lessened, leading to potentially maladaptive cultural practices. Additionally, the apparently alluring promise of always-available hyper-personalized AI therapy, coaching, and even companionship could begin to outcompete humans, even if objectively lower quality and lacking emotional depth, simply due to the ease of access and adaptability of such systems. For a more in-depth discussion of this topic, see Kulveit et al. 2025 2:1.

Rate of Development

How much time should we expect to have to solve these problems? One important factor to consider is that AI companies are likely to invest heavily in improving their models' abilities to carry out autonomous AI research, given the immense potential economic value of doing so [3]. Even if progress is initially impeded by bottlenecks like model unreliability, human approval, and limited compute and energy, the incentives to unblock these will be so great that before long we should expect solutions to be found. Frontier AI companies are well aware of the bottlenecks to their growth trajectories, and are working hard to pave the path towards training models that are orders of magnitude larger and more capable than today's [4].

This dynamic, known as "recursive self-improvement"—AI systems tasked with improving themselves—is already happening to some degree today, and it is likely to lead to an ever-accelerating rate of development of AI capabilities as more capable models provide ever-stronger "uplift" to human researchers. If models surpass the threshold of being capable of operating largely autonomously—that is, producing hypotheses, developing efficient tests of those hypotheses, and analyzing the results to make iterative improvements to themselves—we might experience an "intelligence explosion", wherein countless digital minds running 24/7 at superhuman speeds—a "country of geniuses in a datacenter" [5]—drive rapid progress in AI research, on a timescale we couldn't hope to keep up with. For further discussion on this topic, see Forethought's Three Types of Intelligence Explosion.

For the above reasons, there is a real chance that AI systems with human-level capabilities across all fields, often referred to as artificial general intelligence, or AGI, could be developed within the coming 5–10 years, with many estimates from AI researchers and forecasting experts converging around the year 2033 [6][7]. While previous technological revolutions developed at a pace that allowed humanity to gradually adapt laws, cultural norms, and education over the span of decades, the rate of change we can expect in an AI-powered future will be entirely unprecedented and force a significant reorganization of many parts of society, possibly in an astoundingly short timeframe. Therefore, it is imperative that we greatly increase investment in fortifying all parts of society.

Mitigations

In order to strengthen our defenses against these risks, we will need to devote historic amounts of capital and effort in the coming years. We will need thorough and continuous measurement of our reliance on AI systems to have metrics to guide discourse and to use as a basis for enacting critical policy. Additionally, we will need to conduct research on how we can use AI systems in a sustainable way that benefits us in the long run.

To this end, we greatly need more research organizations like Epoch AI, measuring and forecasting AI progress, and METR, conducting in-depth research such as that described in their Frontier Risk Report. Crucially, we additionally need much more research examining the usage and impact of AI on all parts of society, such as the Anthropic Economic Index, communicated widely.

Furthermore, we need to build tools that will make our society more resilient to shocks resulting from the integration of AI. Beyond these tools and improved alignment methods that enable training models that robustly behave in pro-human ways, such as refusing to display shallow imitations of affection, we will also need significant international regulation. AI legislation has to a large degree thus far focused on present-day harms such as deepfakes and disinformation [8]. What's needed on top of this is legislation that specifically mitigates disempowerment.

Can't we just pause development of frontier AI?

While some argue for a global pause on or a mandated deceleration of frontier AI system development (such as Pause AI), I personally believe such a pause is likely unachievable, and not even necessarily a net positive. A pause could potentially backfire as a result of pushing AI development underground, to actors less concerned with safety and who would withhold progress from the public.

The challenges ahead are great, but so too are the potential upsides. We still have time to act to prevent the worst outcomes, but the window may be closing and much work is needed.

What follows is my personal plan, given the context above.

My Plans

My current plan is to contribute to mitigating the above risks in three primary ways: developing my ability to conduct technical research, fostering a local AI safety community, and exploring potential mitigations to gradual disempowerment risks.

Firstly, I will greatly develop my understanding of technical AI safety in the coming months in order to get a deeper awareness of the best tools we have developed for understanding and controlling frontier models. This is where I can best leverage my career experience, though my long-term focus may shift after this period.

Secondly, I plan to continue facilitating and growing a thriving local community of concerned individuals in order to spread knowledge, enable networking, and gather a wide array of perspectives. [9]

Finally, I plan to explore the ways I can contribute to building mitigations against disempowerment. This project may initially be developed either during a fellowship or independently. Some preliminary ideas include: building trustworthy open-source coordination tools for both humans and AIs or developing further the ideas proposed in Gradual Disempowerment, focusing on other societal systems. If promising, I can imagine founding a research organization that would work to further develop these mitigations.

Last updated: 2026-05-31

  1. ^

    METR: Time Horizon 1.1

  2. ^

    Kulveit et al.: Gradual Disempowerment, January 2025

  3. ^

    Situational Awareness, Leopold Aschenbrenner: From AGI to Superintelligence: the Intelligence Explosion

  4. ^

    OpenAI: Stargate

  5. ^

    Dario Amodei: Machines of Loving Grace

  6. ^

    Metaculus: When will the first general AI system be devised, tested, and publicly announced?

  7. ^

    80,000 Hours: When will AGI arrive?

  8. ^

    EU AI Act

  9. ^

    Stockholm AI Safety



Discuss

Notes on axes of variation in third-party risk assessment

Новости LessWrong.com - 31 мая, 2026 - 23:48

There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful thinking about the space over the last few weeks.

(Thanks Ajeya Cotra and Paul Christiano for discussions that inspired most of this.)

Throughout this, I refer to the actors as:

  • Developers.
  • Stakeholders. These are the people who want to be informed about risks. Possible stakeholders include: governments, the public, the developer's board, the developer's employees.
    • The choice matters because one of the roles of an auditor is to review confidential info that they then do not directly disclose to stakeholders, they only tell them their conclusions. This is a more important role if the developer is more concerned about disclosing confidential information to the stakeholder.
  • Third parties. I don't know a better term for "independent actors who contribute in various ways to a stakeholder's understanding of risks through producing and evaluating evidence and/or arguments". Like, it's weird to call the physical security pentesting firm a "risk assessor". And AI Lab Watch isn't really an "auditor". And "evaluator" makes it sound like they run model evals.

The next step in the analysis will be to think about different objectives of third party risk assessment and think about how those interact with these axes of variation.

AxesFact generation vs evidence analysis

This is maybe the one that comes up most. I'll define:

  • A fact-generation assessment is trying to answer some narrow question to produce a fact that will later be used to evaluate risk.
    • Examples:
      • Dangerous capability evaluations, e.g. METR and UK AISI's capability evals.
      • Evaluating the robustness of a safeguard, e.g. classifier robustness redteaming or the David Rein red-teaming project.
      • Pentesting to measure security.
    • For many of those, a core reason why it makes sense to have a third party do them is that the process is sort of fundamentally adversarial—the final evidence will be of the form "I tried to demonstrate danger, but failed", so you're really depending on the person trying hard, and it's structurally easier to be confident that someone's not sandbagging if they're an unconflicted third party. Some cases are like monitor redteaming or pentesting where it's obviously adversarial; others, like dangerous capabilities evals, are cases where you're relying on the evaluator to do not-particularly-adversarial optimization (e.g. you need them to pick good hyperparameters for their fine-tune to elicit CBRN capabilities).
      • If our claim is "you need a third party to do the optimization every time that you argue for safety by saying an optimization attempt failed", this maybe has broader implications, I'm not sure people are applying this consistently.
    • For the others that reason isn't present. For example, for a cyber capability eval where the AI company basically does no extra elicitation, it seems like it would probably be fine if the AI company just ran the evals themselves—the produced fact isn't rendered particularly suspect by the fact that it was done in house.
    • There are some other reasons that it might make sense to have a third party do the eval:
      • They have expertise the AI company doesn't have in house. This totally makes sense in principle; it could totally make sense to centralize the expertise somewhere outside the company so that the AI companies can share it.
      • The eval requires sensitive info that the evaluator doesn't want to give the AI co. E.g. sensitive CBRN stuff? This makes some sense.
        • In general, eval scores can generally be increased or decreased by a developer if they have access to the dataset, because they can train/iterate against it. But in many cases the developer can also game eval scores even without access to the dataset; e.g. they can probably mess with alignment evidence by training models to act aligned in weird situations. If an eval score won't be valid evidence without separately understanding whether the AI developer did anything that would compromise it, it's not clearly much worse for the AI company to have the eval themselves and run it themselves.
  • An evidence-analysis assessment tries to combine many forms of evidence into an overall judgement about a high-level question. The central example of such a question is "how much risk to the world does this deployment pose".
    • Examples:
      • METR's review of Anthropic's sabotage risk report. (The sabotage risk report analyzes a narrower slice of risk, but it is an analysis that draws together many different facts to answer a natural question that's pretty close in the causal graph to "how much total risk is there". Unlike e.g. "how robust was this monitor", which requires a bunch of additional context to be interesting.)
      • As a narrower example, our review of the OpenAI CoT training was us synthesizing a body of evidence into an overall assessment.
      • This is what auditing is in the context of corporate accounting—they look at your books and ask you questions to assess the overall financial state of the company.
    • Within evidence analysis, there's a distinction between reviewing of a developer's argument vs. producing their own.

I think that evidence analysis vs fact generation are pretty different and it's not obvious that any projects or organizations should do both. The skillsets seem plausibly very different. Like, I suspect that during the main risk period, only a pretty small subset of the facts relevant to the risk analysis should be generated by third parties (as opposed to being reported by the company). So someone doing evidence analysis will naturally be operating several steps in the "evidence tree" away from someone generating a fact. And so I'm not sure there's that much synergy for these happening as part of the same project. E.g. I don't think it would be particularly useful for the security pentesters to work at the same org that does the overall assessment.

I think it's pretty weird that these are conflated so often (e.g. in lots of the widely-cited papers about AI auditing or evaluations like Frontier AI Auditing and Model Evaluation for Extreme Risks), they seem really different to me.

Laundering private evidence into sharable conclusions?

One function of risk analyses is to inform a stakeholder about the risks posed by a developer. One strategy is to have an auditor who access private information from the developer and uses it to answer a question, and then they give the answer to that question to the stakeholder without telling them the private information that caused them to reach that conclusion. The other option is to do the risk analysis just based on information that the stakeholder already knows (or is cleared to know).

In this post, I call this “evidence-laundering”. I don’t mean this with a negative connotation and will probably choose a different term in future because the negative connotation is so strong. I just wanted a distinct term that emphasizes that information is being processed from private evidence to public conclusions to reduce disclosure of commercially sensitive information.

I think I want to define this as: did the conclusion of a report rely on facts that were not stated in the report? So it's not evidence-laundering if you report a straightforward fact that your readers need to believe that you're not straightforwardly lying about because they can't personally check it. But it is evidence-laundering if you don't report the facts and instead report "based on private facts, things look ok".

Examples of the evidence-laundering strategy:

  • Pen-testing. It's totally normal to have a pen-tester say "we looked for vulns; i won't tell you want they are but here's how bad they were". And for the pen-tester to not tell the stakeholder various sensitive details.
  • David Rein's pen-test of the Anthropic monitors.
  • METR's review of the Anthropic sabotage risk report.
  • Lots of private evaluations in other parts of society. In courts, a psychiatric evaluation involves the psychiatrist talking privately to someone and then stating the conclusion publicly without fully justifying it.

Examples of the just-public-evidence strategy

  • Our review of the OpenAI CoT training thing. The workflow here was that they shared the draft with us a week or so before they published it, and we went through a few rounds of me complaining that the draft didn't have some important info that I needed to be convinced of their conclusion, and they iteratively added more info to the draft until I was pretty happy. So even though there was some temporary secrecy, we were fundamentally not in a laundering role: the thing that we released was based entirely on information that any reader of our blog post could read themselves. We were under our standard NDA, so they could have told us stuff during this process that we wouldn't have been allowed to repeat.
  • AI Lab Watch. I think Zach sometimes emailed AI companies asking them questions that he said would affect their rating, and maybe they sometimes answered? As long as this works via him entering their answers into the public record, he's not playing the evidence-laundering role.
  • The METR Frontier Risk Report is almost entirely this. Their report is entirely based on info that AI companies agreed to share publicly, except that some of these facts are published without attribution (that is, METR doesn't say which company said something), so METR is playing the evidence-laundering role of verifying that an AI company really did say something without revealing who said it.
    • This is an interesting case because METR collected a bunch of non-public info during this project, but the only way this entered their report was by companies agreeing to make subsets of the info public.
  • Evidence-laundering is a central part of what "Frontier AI Auditing" (2026) means by "auditing", which writes: "Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential".

The core strategic question here is definitely "when should you use evidence-laundering vs transparency". 

  • I think that transparency is probably better all else equal? Evidence laundering should probably be reserved for cases where it's extremely defensible for the developer to not want to publicly release certain info. I am extremely sympathetic to this for pentesting or other things where there's an imminent cost to releasing the full evidence. I am less sympathetic to e.g. developers wanting to keep the capabilities of unreleased models secret.
  • But I think it might be pretty good if we can rely on transparency as much as possible.
  • In particular, I think that transparency maybe makes it easier for risk evaluations to differentially pressure AI companies to do better. I am excited for project like AI Lab Watch that do risk evaluations that compare the AI companies. This might piss off the AI companies, so it might be hard for them to have an evidence-laundering relationship. Their ratings will be more legitimate if they have the info they need to make their assessments. AI Lab Watch has the stance "if you don't give public statements explaining what you're doing on issue X, we'll assume you maximally suck"; perhaps stakeholders would think that it's reasonable to expect developers to produce public evidence of safety and therefore the developers will publish their evidence publicly. This allows us to have a robust mechanism for between-company pressure on safety for longer than if we tried to do all this without transparency, using evidence laundering.

A few notes:

  • Reporting a score against a benchmark the evaluator built but hasn't released — "DeepSeek scored 40% on our private benchmark" — isn't laundering of the developer's information (you know the claim and whom to blame), but it does ask the reader to trust, sight unseen, that the evaluator built a reasonable benchmark. The developer itself may not even have access to check it. Releasing an i.i.d. subset of the benchmark resolves most of this.
  • A report has to disclose how it looked. Every conclusion implicitly relies on the unstated fact that nothing else relevant turned up. So a clean report should say how it searched for other kinds of facts — including disconfirming ones — and report that there weren't any (or what it found). A report that presents its supporting facts and stops is laundering the adequacy of its own search.
  • Whose secret? Company vs auditor laundering. It helps to split laundering by who owns the withheld fact. Company laundering hides facts about the developer's own system — details of security systems that a pentester is attacking, or training methods that an auditor wants to know about. Auditor laundering leans on something the assessor holds and won't share: a particularly strong example of this is hazardous knowledge like bioweapon-uplift details. Note that auditors very often have some information that they don't want to publish. Note that needing to trust the auditor did a competent job — e.g. elicited hard enough — is not laundering at all; that's ordinary trust in a person, not a hidden fact. The table below splits these into two columns.
Incentive compatibility vs calibration

Should you give bad scores to developers if they don't give you sufficient access, or should you just use your best guess? If you do the latter, then anyone performing surprisingly poorly is better off not disclosing this. But if you do the former, then your risk assessments can’t be taken at face value by third parties, and it’s easier for AI developers to discredit you by saying “that org has no idea what’s going on, obviously we’re way safer than they think”. This probably works especially well when they’re trying to discredit you to their employees or other groups who have more access to private info. As an example of that dynamic, AI Lab Watch initially rated Google poorly on security due to them not disclosing much about their security, and this led some GDM people to say they thought it was less credible or usable.

As I said in the previous section, it's maybe rough to get stakeholders on board with risk assessments that assume the worst, but maybe it's doable and I think we should plausibly try to achieve this.

Current risk vs preparedness

Are you answering a question like "are the current risks adequately handled" or are you answering "is this developer on track to handle risks later"?

The downside of analyzing current risks is that they're inconsequential and will probably be inconsequential up until shortly before they're severe.

The downside of analyzing preparedness is that you have to be much more opinionated about futurism and threat models; your reports will rely on assumptions that are much more contentious.

Cross-developer comparability

Are you trying to make it easy or hard to compare between companies?

  • AI Lab Watch and friends are trying to make it easy. The advantage of this is that it might pressures companies to behave better or shift resources towards better companies (e.g. because people prefer working at companies they believe to be more responsible).
  • METR's FRR went out of its way to make it hard. The advantage of making it hard is that companies are more willing to tell you stuff. (One way of thinking about this is that if you remove comparability, scary stuff the AI companies say has costs that are split among them and their competitors, which is way better for them on net than the costs being concentrated on them. In practice, AI companies also get some cred from safety people for producing evidence of scary stuff, so this effect hasn't mattered that much in practice.)

I think that cross-developer comparability makes much more sense if you're doing evidence analysis, because evidence-analysis assessments are more comparable between companies.

So kind of obviously, I think that the choice between these is basically a tradeoff between different goals: if you mostly want to pressure AI companies to behave better, then comparability is good; if you mostly want to inform stakeholders about the overall level of risk, then comparability is probably bad if you were also trying to do evidence laundering, because it makes it more costly for developers to share info with you.

Examples, classified against the axes above

Project

Fact gen vs evidence analysis

Company laundering

Auditor laundering

Current vs preparedness

Cross-developer comparability

Reviewer vs producer

Dangerous-capability evals (METR, UK AISI)

Fact gen

No

Yes

Now

Easy

Producer

Classifier-robustness red-teaming for misuse prevention

Fact gen

No

Yes

Now

Easy

Producer

David Rein's red-team of Anthropic monitors

Fact gen

Yes

No

Now

Hard

Producer

Security pen-testing

Fact gen

Yes

No

Now

Hard

Producer

CAISI DeepSeek evaluation

Fact gen

No

No

Now

Easy

Producer

Apollo in-context scheming evals

Fact gen

No

No

Now

Hard

Producer

METR review of Anthropic's sabotage risk report

Evidence analysis

Yes

No

Now

Hard

Reviewer

Redwood review of OpenAI CoT training, External review of DeepMind scheming-inability safety case

Evidence analysis

No

No

Now

Hard

Reviewer

METR Frontier Risk Report

Both

Slight (non-attribution)

No

Now

Very hard (deliberately)

Producer

AI Lab Watch, FLI AI Safety Index

Evidence analysis

No

No

Both

Easy

Producer

SaferAI risk-management maturity ratings

Evidence analysis

No

No

Prep

Easy

Producer

GovAI third-party compliance reviews (proposal)

Evidence analysis

Yes

No

Prep

Med

Reviewer

Brundage et al. AAL framework (proposal)

Both

Yes

No

Both

Easy (via AAL scale)

Both


The columns are far from independent: the whole table can be recovered by a short chain of single-question splits. The tree below is the chain that minimizes expected questions-to-classify. Note that auditor laundering occurs only at the fact-generation end, and evidence-analysis assessors only ever launder company secrets or nothing.




Discuss

The main impact from automated AI production: concentration of power?

Новости LessWrong.com - 31 мая, 2026 - 23:42

There’s a lot of talk about automated AI R&D and the like. It’s been discussed since at least 1965 when statistician I.J. Good coined the term ‘intelligence explosion’:

an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make. — Good, 1965, Speculations concerning the first ultraintelligent machine.

Since the 2000s, a few researchers (recently increasingly many) have taken seriously the prospect that some kind of intelligence feedback mechanism could kick in around the time that AI becomes competent enough to contribute to its ‘own’ (or rather, its successors’) development.

This is increasingly recognised as a prospect we should take seriously. It’s been discussed somewhat furtively by leading AI developers for many years, but in late 2025 it became a public talking point. Sam Altman, CEO of OpenAI, recently set a target of fully automated AI-improving-AI by 2028. Jack Clark, co-founder of Anthropic, writes ‘AI systems are about to start building themselves’.

I’m not going to address the resulting prospect of speedup here, which I think is real, though perhaps surprisingly modest.[1] (If, like some of my friends, you think I’m underweighting this prospect, you should probably be even more concerned about concentration of influence.[2])

Rather, this note is briefly making the rather simple (but I think rather overlooked) observation: plausibly automated AI R&D would enable a (perhaps severe) concentration of influence… and this could be the most important effect to be paying attention to.

It’s a fairly obvious point once made. A more thorough analysis might shed light on how likely this is, the most important consequences, and any leading indicators to pay attention to. I’ll only make some starting gestures toward these.

As I presaged in Engineering a Safer World,

R&D automation…: fewer human participants means fewer whistleblowers, less internal scrutiny, less governance and decisionmaking robustness. More concentration of that influence. That could mean more single points of failure. It’s famously difficult to maintain a conspiracy of more than one or two people: ‘two can keep a secret if one is dead’, as they say! And besides conspiracy, compared to larger teams, individuals and small groups may be far more susceptible to capture, coercion, corruption, or plain foolishness and rash decisionmaking.

Whose influence?

To be clear, you don’t need to be concerned about any specific people currently in positions of particular influence to think that concentration could be very impactful (and concerning).

The wisest supreme leader is still a single point of failure[3]. Especially during a period of tumult, leaders can be displaced and reporting structures revolutionised. Concentration of influence — even the prospect of that concentration — simultaneously raises the incentives for corruption and power struggle (while raising the stakes for the rest of us). Meanwhile, power corrupts and a centralised decisionmaker can rarely account for all the relevant considerations, even if they mean to. In AI, we’ve already seen company leadership falling out over differences of opinion, fighting it out in the boardroom and in court, and even clashing with governments. Turf war over the means of production of AI (which may increasingly equal the means of production full stop).

The point is that losing checks and balances can be a problem whoever is at the top[4]. As in politics, so in the heights of industry — which, today, and increasingly, firmly includes AI.

Suffice to say that I imagine the kinds of roles in the AI oversight chain which might be (nominally, initially) preserved longest are senior research directors, company executives, and closely-involved regulatory or executive government officials.

Perhaps more important: whose influence lost? Directly, the skilled human labour which today is involved in researching, engineering, and deploying frontier AI systems. Via them, their wider research networks, social ties, and any societal oversight they might have provided (e.g. whistleblowing to journalists or authorities). Plausibly also the marketers, account managers, even lawyers and others who interface with what a company is doing with its AI products, and their networks. (If not themselves replaced, they’d lose collegial understanding and influence via the replaced researcher-implementers). In short: everyone else.

Sketched using claude.ai

Concentration without (overt) malice

Why shed the researcher-implementers, the deployment engineers, and others (or sideline them from the important ‘real’ responsibilities)? It doesn’t have to be a power-grab, at first (or even at all) — it’s just good business! Human employees (especially software experts) are incredibly expensive compared with AI.[5]

Beyond this, analogously to areas where AI already surpasses humans (like chess), human participants may positively get in the way, slowing and compromising what would not only be a cheaper, but a more effective AI-only workflow.

A company might choose to instead eat this cost — perhaps it prefers to exhibit loyalty to current staff, acknowledges concerns around erosion of human oversight, or is inherently conservative in that way. But in a competitive environment (either commercially or in strategic and security terms), that may be tantamount to inviting defeat or irrelevance, if competitors are charging ahead.

At some point shareholders and other stakeholders might think it irresponsible not to embrace a replacement AI workforce. Ultimately this could add up to far fewer people having insight into what’s going on or a say in its direction. Even if you can’t (or don’t want to) immediately fire them, you might deliberately or even passively sideline your human employees as more and more work is carried out by machines.

Influence over what?

Notably, board, investors, auditors, journalists, and regulators (who already often suffer an information scarcity) — and the broader society they represent — could also be substantially cut out if logging, analysis, PR, and incident reporting are automated… This might only require the usual levels of corporate paltering rather than special malice. It might even be prosocially-motivated if the relevant actors are (justifiably or not) concerned about a power-grab by government. (It could of course also be deliberately selfish.)

Conversely, a government or regulator which strong-arms its way into the midst of the development process might itself become the locus of concentrated influence. (This in turn might be superficially or genuinely motivated by national security concerns.) The more automated a firm, the easier it may be to seize, either overtly or through more subtle channels. When employees are critical pieces in a production process, that’s a taller order.[6]

Sketched using claude.ai

I’m not suggesting any of this leads (immediately) to influence over all the world (though if you do take over the world, please consider my friend Owen’s advice for what to do if you find yourself in that situation). Nor even over an entire economy or state. But minimally, this looks like highly concentrated influence over a frontier AI company: presently a rare, increasingly lucrative, and burgeoningly politically-influential machine.

Activity at that frontier — the decisionmaking by leading AI developers — stands a chance of being among the most economically, politically, and strategically important activity in the coming years. If the cat is out of the bag with AI, that frontier, steered wisely, might be an important part of a societal resilience and readiness process (compare the recent Project Glasswing in cybersecurity). Used foolishly or even maliciously, the frontier might degrade societal capacity right at a time when it could be most important.

Influence… or power?

To repeat: I don’t think it’s given (or even particularly likely) that the starting possessors of a hypothetical concentration of influence would remain in control. From the point of view of a rogue AI, for example, control over a frontier AI company might be a delicious opportunity for expansion, defence, and societal dependency — ideal routes to escalated un-unpluggability. Likewise from the point of view of a power-hungry politician or corporate psychopath. Notice that corruptibility and coercibility are perhaps more plausible with fewer hands: it’s less practical to threaten a whole team or company than a single person.

So if this hypothetical automation-driven concentration of influence is a concentration of power, I don’t expect the ultimate wielder to be worthy of it. This marks a true difference of opinion with some others I know, who might think it more likely that, say, some current AI company leader could both accrue and keep hold of this level of power, and also use it wisely.

The obvious paths to hard power via AI are in surveillance — newly scalable with AI analysts scrutinising as many people’s movements as desired — and in development and automated production of flexible, high-range tools of coercion like weaponised drones. Other kinds of power might come from positions of economic leverage or narrative and propaganda dominance, conceivable with advanced-enough AI and without societal defences.

Concretely, what might be different?

Let’s really quickly gesture at two cases which might be substantially different in a world where AI production is largely automated.

Imagine a near future: frontier AI company revenues are in the 100s of $billions or $trillions, AI services are used in civil services, executive decisionmaking, faith leadership, … That’s a lot of points of leverage for subtle manipulation. Suppose a leader (selfishly or under pressure) intends to inject secret loyalties into their products. If they’re sole controller without checks and balances, that could be trivial, with obvious concerning consequences. Without automation-replacement, there are perhaps dozens or hundreds of people with reasonable oversight per major release, and several people even for any given small change. That’s a far more difficult system to twist.

Alternatively, consider a company on the brink of truly autonomous, general, human-surpassing AI: a huge responsibility. Many ambitious but underwrought alignment targets have been suggested for such a project, many of them foolish, perhaps fatally so. A leader hubristic without serious checks might withdraw from critique, double down (perhaps sycophantically egged on by AI toadies), and drive through a ruinous agenda. Society at large might be none the wiser (or at least insufficiently alert and able to intervene). A leader willingly taking risks (perhaps with a selfish or misguided vision of unfathomable rewards) would be even more of a problem. On the other hand, a leadership more dependent on staff and teams may be subject to more psychological and cognitive support, scrutiny, moral encouragement, and so on. A literal healthcheck. Of course, human teams can be subject to concerning groupthink (not unheard of at AI development companies!), but the more off-base a suggestion is (ethically or pragmatically), the more likely it is that the collective intelligence of the team corrects it.

Any number of scenarios can be considered; rarely do they appear more promising from a societal perspective if a single or a few leaders are insulated from scrutiny and critique.

What’s the outlook?

It looks very unclear to me. Plausibly it’s in the balance and decisions and conversations now will shape how this plays out!

There are some reasons to think that (near-) automation of AI production could instead produce equal or even improved societal oversight.

For one, the sometimes unintuitive effect of automation is to increase per-worker productivity, making it more worthwhile to bring more humans into the process (even on myopic financial grounds). Historically this effect has anyway often presaged a sharp decline once automation reaches a sufficient level. This might make concentration actually more difficult for a time. I don’t expect this to play out for very long in AI frontier research[7], but I’m somewhat uncertain about this.

Directed sensibly, a glut of researcher-grade AI worker-equivalents could be tasked with analysis, scrutiny, decisionmaking robustness, even auditing and whistleblowing. There’s nothing in principle preventing this, for any given level of trustability in our AI. Perhaps we’ll end up bringing more people into the process in the increasingly important, complementary roles of oversight, decisionmaking, and direction-setting. These folks could be made more effective with AI tools and assistance. That’s a decision to be made.

As I suggested earlier, leaders perhaps soon faced with these temptations may balk, either at the moral prospect… or at the pragmatic prospect of being subsequently replaced, captured, or corrupted in the ways I hinted at.

Perhaps it’s a tenuous balancing act. Internally to a given AI producer, we might want to guarantee a large enough ‘in the loop’ workforce that we can trust internal deliberation and whistleblowing and so on to keep things on the rails. Outside that, we might want to avoid no-holds-barred proliferation of AI-producing capability — but we probably want at least a few developers to keep each other in check and reduce monopoly or kingmaker dynamics.

I don’t expect any such concentrations of influence to play out overnight. Loss of control is a process, not a moment. AI development organisations — and society at large — should be paying attention and having the necessary conversations. Much to look out for, and perhaps much to look forward to!

  1. ^

     Task-relevant data and compute look like the more biting bottlenecks, and research taste accrues mainly through expensive experimentation.

  2. ^

    (unless you expect such accelerated singularity that influence prior is all that matters)

  3. ^

    And may I be forgiven for slandering the current crop of potential supreme leaders as not very wise

  4. ^

    I would note that we usually need some concentrated executive and representative power in various roles in society — but we want those roles to be sufficiently monitored, and we want the processes which move people into (and out of) them to be sufficient to produce worthy selections.

  5. ^

    This ‘humans expensive, AI cheap’ point is worth taking some care over. Contemporary frontier AI training is getting more expensive by the year. Running the best AI can be increasingly expensive, because ‘thinking harder’ is an effective way to scale capabilities on current margins. But once a capability is unlocked at the frontier, it typically becomes rapidly and radically cheaper, due to ongoing rises in compute efficiency and the ability to distill long thoughts into quicker reflexes. Further, you only need to train AI to do something once, and then it can be copied and run as many times as you want, it doesn’t need to sleep or take family time or get sick etc.

  6. ^

    For example, witness the periodic furore among tech company employees when their work is used toward military ends (latest).

  7. ^

    One reason is that this research does not scale very well in parallel, so returns diminish steeply per worker — and we’re already imagining AI that can substantially slot in as a cheaper replacement worker in most cases anyway.



Discuss

A Song About No

Новости LessWrong.com - 31 мая, 2026 - 23:40

When Lily was about two she told me she wanted a song about "no". This was ten years ago, and I don't remember why she wanted this, but I made something up:

This is a song about no.
This is a song about no.
This is a song about no, no, no.
This is a song about no.

The song goes... No no no no no no no no no.
The song goes... No no no no no no no no no.
The song goes... No no no no no no no no no, no, no.
This is a song about no.

It's useful anytime the kids want a song about something I don't know a song about. For example, it often served as a song about "turtle". Of course the more syllables the subject has the harder it is to sing, but that just makes it more fun.

youtube

I applied my nascent music writing skills, and tried to set it down in dots:



Discuss

Visualize Cyclical Structure in Llama Model

Новости LessWrong.com - 31 мая, 2026 - 22:39
Summary

Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what structures we can find for different concepts in LLM’s. I made a tool allowing anyone to generate and visualize 3d structures for some arbitrary concept. The tool currently has support for cyclical and smooth structures, extracting activations at any given layer in the Llama-3-8b model. Check it out here:

https://neural-geometry.vercel.app

Introduction

Traditionally, mechanistic interpretability has focused on steering models with linear direction vectors. Underpinning this is the Linear Representation Hypothesis, which states that all features in language models are collections of sparse linear directions. While linear activation steering has proven useful to shift model behavior, it can also produce unexpected behavior. When the scalar's magnitude becomes too large, the language model often breaks down, producing garbled and incoherent text. Personally, I do not think that the linear representation hypothesis is sufficient to explain what is going on mechanistically in these large language models.

Recent research from Goodfire has focused on exposing the smooth geometric structures that form for certain ordered concepts, such as days of the week, seasons, age, temperature, and more. Recurring ideas like days of the week and seasons form smooth cyclical structures, whereas linear concepts such as temperature and age form arcs. The work here is directly inspired by Goodfire's paper and research.


Extracting Activation Vectors

One of the most important steps in this pipeline is extracting meaningful activation vectors from carefully constructed prompts. For example, Llama may be prompted, "What day comes 3 days after Tuesday?" or "What day comes before Thursday?" Ground-truth labels are simply the correct answer to the prompt. Activation vectors are extracted from the last token to capture maximal semantic meaning.

The neural geometry tool used the GPT-4o-mini model to generate a multitude of prompts with ground truth labels for any given concept. These prompts are often favored for cyclical and linear concepts, where semantic meaning can easily be captured. One potential issue with this style of prompting is that we may be imposing a cyclical structure through our choice of words when there otherwise may not be one. For example, prompts such as "What day comes two days before Wednesday?" when repeated across every weekday, may impose a cyclical structure when there otherwise may not be one. It would be interesting to explore whether the same structure emerges when a different style of prompting is used.

Projection Down to 3D

Raw activation vectors are first projected down to a 64-dimensional space. Here, the vectors are averaged for the same ground truth labels to produce centroids. These centroids are more stable in terms of their semantic meaning than any individual vector alone.

We do not project directly down to 3D; 64D serves as an intermediate step. The purpose of projecting down to 64 dimensions is to reduce most of the noise that exists in a high dimension, like 4096. Important semantic meaning may only be encoded in a small number of dimensions. Computing centroids in 64D is also much more computationally stable than computing in 4096. 64 is a somewhat arbitrary dimension; 32 and 128 can also work just fine.

A smooth curve is fitted through the centroids that are produced in 64 dimensions and sampled at 300 different points. The resulting spline curve is then projected down to 3d via PCA. This is the smooth curve that is displayed on the website after the full pipeline is finished.

Conclusion

Overall, I find this to be one of the more interesting research directions coming out in the field of mechanistic interpretability. While linear directions have been shown to causally influence model behavior, assuming that all features are simply linear direction vectors seems to be a nascent theory of interpretability and alignment in general. Rather, the structures within language models are likely much more complex and have nonlinearities, which is why models break down when steered too far. I am excited to see where this research leads in the future. In the meantime, I hope this tool can make this field of research more accessible for everyone.







  1. ^




Discuss

Why I think evals are pretty important and most worth working on (for me)

Новости LessWrong.com - 31 мая, 2026 - 22:31

An application response I wrote! Feel free to leave feedback!


What are you most concerned about when it comes to risks from AI?

I’m most concerned that many people will be harmed very soon, and particularly that we won’t know why. Since politics and government dictate public life, solving said harm would require palatable translations of technical and sociological knowledge by experts for institutional changemakers to act on.

However, evaluations meant to exact such knowledge are systematically unreliable. Anthropic’s BrowseComp “benchmaxxed” by independently both achieving eval awareness and managing to scrape the specific eval it was being tested on. Human-designed audits may also structurally indicate evaluation (Gao and Kreiss), with distribution shifts making evaluation paradigms systematically inaccurate to general, out-of-lab use. Even seemingly optimistic advancements (Constitutional Classifiers) demonstrate the insufficiency of pure output-level evaluations, as safety now necessitates interpretability of internal activations.

Capabilities risks in particular are also fast accelerating, enough to potentially saturate even robust metrics like METR’s time horizons (Cotra). Certain MLE advancements (Joo et al.) are regarded “surprisingly” effective; implicitly, neither designed for nor necessarily predictable from first-principles. New findings like the Platonic Representation Hypothesis even conjecture that unintended capabilities improvements are actually systematic, as multimodal models converge to a shared statistical representation of reality. Some scholars (LeCun et al.), by contrast, purport that instead of convergence to general intelligence, various models trained to various specialisations will constitute a more legible and steerable “Superhuman Adaptable Intelligence”. This, however, is exactly my concern: not only are domain-specific superhuman intelligences structurally impossible to oversee (novice-grandmaster problem), but aforementioned evidence shows broad, unintended capabilities may arise from ostensibly unrelated changes or optimisations. Capabilities aren’t decomposable into what we can measure for, and certain framings of AGI (or “SAI”) like LeCun’s might obscure that fact.

AI systems as information/thought filters have sweeping social impacts; empirical studies show systematic LLM bias in news summarisation (Savgira et al.). Widespread adoption may enhance manipulation of public opinion and structurally constrain “responsible AI” within institutional profitability (Mitra). Broadly, my concerns regard our capabilities/risk evaluation methodologies and ontologies being systematically wrong. Without enforcing reliable ground truths, we risk suffering every technical problem at once as scientific voices may fail to move institutions away from trajectories of harm.


Works Cited

BrowseComp: https://www.anthropic.com/engineering/eval-awareness-browsecomp

Gao and Kreiss (case study of gender bias in LLMs): https://arxiv.org/pdf/2509.04373

Constitutional Classifiers: https://www.anthropic.com/research/next-generation-constitutional-classifiers

Cotra: https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities

Joo et al.: https://arxiv.org/pdf/2602.15322v1

Platonic Representation: https://arxiv.org/pdf/2405.07987

LeCun et al.: https://arxiv.org/abs/2602.23643v1

Savgira et al. (no link): What Stays and What Goes: Auditing the Impact of LLM Summarization on News Partisanship. Pavel Savgira, Elisa Kreiss, Homa Hosseinmardi. CHI conference on Human Factors in Computing Systems: Late Breaking Work 2026.

Mitra: https://disjunctionsmag.com/articles/why-leaving-big-tech/




Discuss

Financial Costs of an AI Pause?

Новости LessWrong.com - 31 мая, 2026 - 21:55

I’ve analyzed the near-term economic effects of an AI pause, out of concern for my investments, and a desire to predict how strong political opposition to a pause is likely to be.

My median estimates: The S&P 500 will drop 27.8%. AI subsectors will drop 34-69%. Interest rates will rise at a much slower rate than would be the case without a pause.

The specific numbers depend on some fairly arbitrary assumptions. So please read this post in order to get a feel for how the results depend on the assumptions. I’ve tried to keep the assumptions reasonable, but some of them will prove to be wrong. My most controversial assumptions reflect an expectation that both markets and voters will be surprised at how powerful AI is, mainly in 2027.

For the full model, along with many explanatory comments, see the Python source code here (zip file).

This conversation with Claude clarifies my reasoning in more detail than most people will want.

Sensitivity to Assumptions

Here’s how my model says the impact is influenced by changes in assumptions. Numbers are for the immediate change in the S&P500 and in the stocks of hyperscalers (Microsoft, Google/Alphabet, Amazon, Meta, and Oracle).

AI economic centrality


low (0.4)

medium (0.7)

high (0.95)

SP500

-22.5%

-27.8%

-31.8%

Hyper

-26.8%

-34.1%

-40.4%

How central is AI to economic growth? Low means AI is one technology among several. High means AI is the dominant driver of growth, and interest rates are pushed up by massive AI-related capital demand.

training dependence


low (0.25)

medium (0.5)

high (0.8)

SP500

-22.2%

-27.8%

-31.2%

Hyper

-26.3%

-34.1%

-38.9%

How much of AI’s near-term economic value requires new frontier training runs? Low means most value comes from deploying and refining existing models. High means the next major value unlocks require fundamentally new capabilities.

S&P 500 immediate impact: AI centrality × training dependence



Train: low

Train: medium

Train: high

Central: low

SP500

-18.9%

-22.5%

-25.0%

Central: medium

SP500

-22.2%

-27.8%

-31.2%

Central: high

SP500

-24.6%

-31.8%

-35.9%

post pause cognitive improvement


weak

moderate

aggressive

SP500

-24.5%

-27.8%

-30.3%

Hyper

-29.5%

-34.1%

-37.7%

I’m predicting that at the end of the pause, AI training would resume with some moderate regulation. This variable captures how much progress to expect compared to a no-regulation scenario. I’m using 75%, 50%, and 30% of unregulated progress for the weak, moderate, and aggressive post-pause regulations respectively.

compute growth rate


40%

63%->40%

80%

SP500

-24.1%

-27.8%

-30.8%

Hyper

-29%

-34.1%

-38.4%

This variable describes the hardware constraints on AI growth. It represents how fast the available compute would increase given unlimited demand. The middle column assumes that growth gradually slows between 2028 and 2040 from 63% to 40%.

pause duration


1 year

2 years

4 years

SP500

-27.5%

-27.8%

-29.6%

Hyper

-33.7%

-34.1%

-36.7%

task density


low

moderate

high

SP500

-27.1%

-27.8%

-28.4%

Hyper

-33.2%

-34.1%

-34.9%

High task density means there’s still plenty of low-hanging fruit, and we haven’t yet reached the steepest part of the S-curve.
Low task density means we’re using up the low-hanging fruit, and we’ve passed the steepest part of the S-curve.

What Kind of Pause?

I assume that governments decide in late 2027 to treat AI as slightly more dangerous than nuclear weapons, due to some combination of job displacement, and accidents that are more concerning than the one depicted in the movie 2001: A Space Odyssey.

I focus on a scenario where an international agency enforces drastic limits to AI development for two years, starting at the beginning of 2028. During 2028 there is an expectation that significant AI development will resume in 2030. I will focus my analysis on the economic effects during 2028, and assume that actors during 2028 will only have rough guesses as to how fast development will be allowed to proceed in 2030. I will assume that their mean forecast involves some sort of resumption of progress, but little confidence in full-speed development being allowed in 2030.

The pause will restrict all datacenters more powerful than a certain threshold, roughly corresponding to the level of the best AI which was released in 2026. The details of the pause will depend somewhat on insights that won’t become available until we know more about how AI is progressing in 2027.

Instead of predicting what the pause will apply to or how it will be enforced, I’ll predict that it will be effective enough to slow AI capability progress by a factor of 5 compared to what it would be in the absence of regulation. Since regulation will be imperfect at distinguishing between harmless research and the research it intends to pause, I’ll estimate it causes a 15% slowdown in other high performance computing.

Impact

I assume the initial effect of the pause will be a 90% reduction in spending to train AI. That effect will be somewhat offset by lower prices on that compute stimulating increased demand for inference.

I assume that GDP growth rates under a no-pause scenario would gradually rise to 30% by 2040. This is more a reflection on what markets would predict in 2028 than a genuine estimate of what I expect would happen with no regulation. I see a fair amount of room for higher growth rates.

I predict interest rates in 2029 of roughly 7% with a pause, compared to 11% without any AI regulation.

I predict that robotics progress will continue to have roughly the same increases in economic impact that it would have had without regulation. I’m moderately confident that AI already has nearly enough general intelligence for robotics to have transformative impacts on the economy, and that the remaining engineering that is required is ordinary enough to only be slowed down a little by the pause. That slowdown will be offset by the pause reducing the extent to which AI training competes with robotics for resources.

I’m assuming that financial markets are mostly rational, and will adjust price/earnings ratios mainly in reaction to predicted growth rates and interest rates. I assume markets will briefly over-react to a pause due to increased risk aversion and margin calls. I assume that pre-pause market levels would not be considered to be bubble-like under a no-pause scenario.

Here’s a more detailed set of predictions for the median set of my model’s assumptions:

Model Output: Executive Summary Pause period: 2028–2030
Post-pause cognitive improvement: 50% of unrestricted rate
Model horizon: 2040

Net present value of foregone AI revenue: $ 111.93T
Implied AI sector market cap loss: $ 839.51T

S&P 500 immediate impact: -27.8%
S&P 500 after one year: -17.6%
Immediate market cap change: $ -20.02T
One-year market cap change: $ -12.69T
AI High-Growth Segment Revenue Trajectory Year | No Pause Growth | With Pause Growth | Diff
--------------------------------------------------------------------------------------
2025 | $ 350B 63.0% | $ 350B 63.0% | $ 0B
2026 | $ 570B 63.0% | $ 570B 63.0% | $ 0B
2027 | $ 930B 63.0% | $ 930B 63.0% | $ 0B
2028 | $ 1.52T 63.0% | $ 911B -2.0% | $ 604B << PAUSE
2029 | $ 2.43T 60.5% | $ 1.22T 33.4% | $ 1.22T
2030 | $ 3.84T 58.0% | $ 1.67T 37.7% | $ 2.17T
2031 | $ 6.01T 56.4% | $ 2.29T 36.7% | $ 3.72T
2032 | $ 9.31T 54.8% | $ 3.10T 35.6% | $ 6.20T
2033 | $ 14.26T 53.2% | $ 4.18T 34.6% | $ 10.08T
2034 | $ 21.61T 51.6% | $ 5.58T 33.5% | $ 16.04T
2035 | $ 32.42T 50.0% | $ 7.39T 32.5% | $ 25.03T
2036 | $ 47.98T 48.0% | $ 9.69T 31.2% | $ 38.29T
2037 | $ 70.05T 46.0% | $ 12.59T 29.9% | $ 57.46T
2038 | $ 100.88T 44.0% | $ 16.19T 28.6% | $ 84.69T
2039 | $ 143.25T 42.0% | $ 20.61T 27.3% | $ 122.63T
2040 | $ 200.55T 40.0% | $ 25.97T 26.0% | $ 174.57T
Sector Market Cap ImpactsSector | Pre-Pause | Immediate % | After 1yr %
----------------------------------------------------------------------------------
Semiconductors | $ 8.00T | $ -4.50T -56.2% | $ -3.96T -49.5%
Hyperscalers | $ 22.00T | $ -7.51T -34.1% | $ -5.23T -23.8%
Frontier Labs | $ 3.00T | $ -2.07T -68.9% | $ -1.92T -63.9%
Ai Applications | $ 4.00T | $ -1.74T -43.6% | $ -1.36T -34.1%
Non Ai Sp500 | $ 35.00T | $ -4.20T -12.0% | $ -215B -0.6%
----------------------------------------------------------------------------------
TOTAL (S&P 500) | $ 72.00T | $ -20.02T -27.8% | $ -12.69T -17.6%
Implications

Political pressure for AI regulation is building as increasingly impressive evidence of AI capabilities erodes peoples’ ability to dismiss AI as hype. I expect this to lead to a serious debate among politicians in 2027 about AI regulation. I’m unable to predict what kind of regulation that will produce. So I’ve focused on scenarios that would matter the most if they’re adopted.

The economic impact of a moderately effective pause would be big enough to create medium-sized political pressures to weaken the pause.

There will be significant pressure for a strong pause due to voter concerns about job losses. There will be hard-to-predict pressures from national security professionals related to military risks.

My crystal ball refuses to tell me how these pressures will play out.

I see a very real chance that a debate over a pause will impact AI stocks within a year from now. This effect is worrying enough to get me to take some profits in my AI stocks, at a rate of 1% to 2% per week given recent trading patterns.

I consider a pause to be more likely than do most people. Here are some Manifold markets that I’ve been modestly bidding up:



Discuss

Food, water and power from thin desert air

Новости LessWrong.com - 31 мая, 2026 - 18:24

Wavelength-selective agrivoltaics are here: solar panels that are transparent in the wavelengths chlorophyll can use, while generating electricity from other wavelengths. Meanwhile, the harvesting of water from even air people consider dry is an area of intensive development.

Active systems use electric heat pumps to chill surfaces on which dew then collects -commercialized by companies such as Watergen. More intriguing are passive harvesters. One recently developed in Switzerland manages to chill itself by enhanced infra-red radiation. In Shanghai, Dr. Ruzhu Wang's team has developed salt-infused hydrogels that pull in water at night, and release it when heated by the sun

Put these together, and you can build greenhouses that meets their own needs - and more - without access to a power grid, or any supply of liquid water. Arid sunny places are seen through new eyes. Namibia, for example (where moist air rolls in each night and there are beetles that perch on the tips of dunes, allowing dew to form on their own bodies); or the Atacama desert, the Sinai peninsula, or waterless islands like the Galapagos.

China - with huge expanses of arid land, not considered arable - is on the case. When economies of scale are brought to bear on the elements of these systems, desert towns that make their own food, water and power from what the desert offers in abundance seem destined to spring up like mushrooms.



Discuss

Agriculture needs another revolution

Новости LessWrong.com - 31 мая, 2026 - 18:16
Summary

Vertical farming has the potential to unlock multiplicative yield gains per area of land and catalyze development of new technologies (precision farming, rapid genetic engineering, robotics based automation, etc.). Making this transition will exponentially benefit humanity (and other denizens of earth) in multiple ways.

However, the current focus of indoor / vertical farming is fresh produce. Here, I make the argument that in order to reduce reduce agricultural land use and obtain maximal benefit of vertical farming, we need to primarily focus on cereals, pulses and oil crops, and not just fresh produce.

Background

There are several important reasons to reduce agricultural use of habitable land use:

  • Agriculture use is still driving deforestation across the world.[1][2]
  • Preventing further wildlife biodiversity decline, reverting back agricultural land to natural habitats and making earth equitable for the other denizens of earth[3]
  • Reforestation still remains an effective, viable option of scaling up carbon removal and preventing large offset climate change.[4][5]

Currently, about 50% of the habitable land on Earth is used for agriculture.[6]About 33% of the total agricultural land is used for growing crops (crops for food, feed, biofuels), with the rest 67% used as grazing land for livestock. For this post, I will only focus on the land use for growing crops.[7]

Proposal: Scale Yields Multiplicatively by Growing Food Vertically

In essence, agriculture is massively spread out over large regions of land. Because of this, inefficiencies creep into the system making it harder to:

  1. Maintain consistent yields throughout. Although this averages out globally, the variabilities (critically, losses) are often borne by farmers. This makes agriculture a very uncertain sector, which will be made even more uncertain due to the effects of climate change.
  2. Genetically engineer crops for higher yield and viable seeds. Soil, water and climate conditions vary drastically across the large area of land use making it harder to genetically engineer crops to provide consistently high yield everywhere.
  3. Roll out precision agriculture technologies. Embedding sensors to monitor key parameters such as water and nutrient sufficiency while detecting (and possibly eliminating) pests scales with land area.
  4. Deploy automation and robotics. Robotics requires a constrained environment, at least currently. Agriculture spread over a large area, generally with animal activity, makes it hard to build robots to help with monitoring and harvesting.
  5. Efficiently transport agricultural yield. Losses can occur over multiple stages (processing, storage, transport, etc.), with typical losses being around 10-20% but sometimes going as high as 40%.[8]

A common solution to all this is to figure out how to grow plants and trees, i.e. any flora of agricultural interest, in indoor environments of large multi-stored buildings. This automatically provides the following benefits:

  1. Crop yield scales by the number of floors in the building, automatically producing multiplicative yields and can drastically reduce land use. In contrast, the global green revolution unlocked a 2.5x yield increase over the last five decades[9]
  2. Precision agriculture and robotics becomes easier by designing buildings with monitoring and automation in mind. Climate controlled buildings increase reliability under the effects of climate change.
  3. Spatially concentrated high yields allow self-sufficiency over counties, states and countries, making transportation and supply-chain management easier and reducing post-yield losses.
  4. This is a necessity for humans to become a space-faring civilization ;)

However, I make a couple of key assumptions:

  1. Using multi-storied buildings is cost-effective in the long term, even through they have much higher upfront costs. My rationale is that the advantages of monitoring, climate-control, yield optimization and automation would eventually allow for lower costs per yield.
  2. Power costs will be substantially reduced in the future. Moving agriculture indoors would require generating light, maintaining climate, etc, which would require substantial power. I am assuming that newer technologies with high power capabilities like nuclear reactors (possibly fusion reactors) would reduce power costs substantially in the future.
Avenues for Effective Change

Most of the agricultural land is used for growing cereals/grains, oil crops, and pulses (~85-90% of the land; Fig. 1).[6]To make effective change, we need to focus on decreasing the land use required to grow these crops.

Currently, the focus of indoor and vertical farming is to grow fresh produce. From what I gather, the reason for focusing on fresh produce is because of ease of growth in soil-less systems (like hydroponics, aeroponics, etc.), ability to scale them vertically to increase yield, and a high rate of spoilage loss, necessitating local centers of fresh produce.[10]

However, produce currently make about 3% of the total land use, so they are unlikely to make an effective dent. In addition, methods developed for these plants will likely not be generalizable for cereals, oil, and pulses. To effectively reduce land use for agriculture, the key focus should be on developing technologies that enable mass indoor farming of cereals, pulses and oil crops.


Fig 1: Agricultural land use by major crops (data source: our world in data)


Conclusion

This essay began as an initial question/thought of why vertical farming has not taken off and what are the key bottlenecks that is preventing this transition. A dive into this made me realize that while there is a huge potential for indoor farming to be a foundational and transformative, the current focus on it is not effective. To effectively deploy indoor farming and reap its multiplicative benefits, the focus must be on cereal, pulses, and oil crops that use ~85% of the land used for agriculture for human consumption.

I would love to hear what LW community thinks about this :)

  1. Hannah Ritchie (2021) - “Drivers of Deforestation” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260518-093348/drivers-of-deforestation.html' [Online Resource] (archived on May 18, 2026). ↩︎

  2. Accounting for deforestation and land use, it seems like carbon released for agriculture is equal to (or greater than) the carbon released due to fossil fuels. Paper: Increased transparency in accounting conventions could benefit climate policy - https://doi.org/10.1088/1748-9326/adb7f2; Video. ↩︎

  3. Hannah Ritchie (2021) - “To protect the world’s wildlife, we must improve crop yields — especially across Africa” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260518-093348/yields-habitat-loss.html' [Online Resource] (archived on May 18, 2026). ↩︎

  4. Carbon sink capabilities of forests: Pan, Y., Birdsey, R.A., Phillips, O.L. et al. The enduring world forest carbon sink. Nature 631, 563–569 (2024). https://doi.org/10.1038/s41586-024-07602-x ↩︎ ↩︎

  5. Current advances in carbon removal technologies - refer to executive summary, point 2: The state of carbon dioxide removal: a global, independent scientific assessment of carbon dioxide removal. University of Oxford. https://doi.org/10.17605/OSF.IO/F85QJ ↩︎

  6. Hannah Ritchie and Max Roser (2019) - “Half of the world’s habitable land is used for agriculture” Published online at OurWorldinData.org. Retrieved from: 'https://ourworldindata.org/global-land-for-agriculture' ↩︎ ↩︎

  7. Reducing livestock related land use would require change in food preferences (covered in detail here) and/or development of plant-based meats (focus of good food institute). ↩︎

  8. ^

    Number based on actual loss section in Post-harvest losses and Hannah Ritchie (2020) - “Food waste is responsible for 6% of global greenhouse gas emissions” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20251125-173858/food-waste-emissions.html'

  9. Changes in cereal production yield in the last decade: https://ourworldindata.org/grapher/index-of-cereal-production-yield-and-land-use ↩︎

  10. https://www.ars.usda.gov/oc/utm/vertical-farming-no-longer-a-futuristic-concept/ ↩︎



Discuss

Outrunning your headlights

Новости LessWrong.com - 31 мая, 2026 - 16:46

This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch and run run_entropic_gwot and invoke semi-relaxed GWOT between your models’ RDMs?

Wtf does that even mean? Eh, could be interesting to see the result. Enter

A peculiar side effect of model intelligence in discovery-based research is that it’s possible to run every statistical/quantitative analysis under the sun, burn millions of tokens and gain little intuition on how to make headway into your problem.[1]

Pre-AI, the effort associated with constructing a pipeline to run any meaningful quantitative analysis incurred a real cost. Indeed, it forced one to consider whether the analysis made any sense at all. People who had the necessary skills to execute were often thoughtful about their analysis by necessity. You couldn’t just apply Mendelian Randomisation if you didn't know of its existence; there was effort required to both articulate your question to search for the right tools and de-risk by considering whether a tool fit your question.[2] This forced one to earn some intuition for how it worked and what to expect. The pain that such friction inflicted is now alleviated by the CLI agentic coder of your choice. Some would argue that this frees humans to spar with an intellectual partner at a new layer of abstraction rather than concern themselves with the specifics of code. I would argue that a productive spar requires both parties to hold opinions they are willing to defend and on a day-to-day basis this is not what I see.

Instead, the trajectory of decisions feels almost pre-ordained, primarily driven by three effects. The first is this veneer of problem-specific competence as models string shibboleths together. Shibboleths often serve as a heuristic for expert opinion e.g. septic patient with febrile tachycardia versus high temperature and heart rate. Deferring to expert opinion (or what masquerades like one) feels only natural. The second is an illusion of control as each recommended option is contrasted against straw-men alternatives. The third is sycophancy (one less obvious than the verbal submissiveness), an insidious bias towards writing tests that are likely to produce favourable empirical results or worse interpreting them in such way as to support the framing you seem to desire.[3]

Great news, 90/126 nominal p-values are significant. Would you like me to proceed with the write-up? [4]

Holding on to one’s critical eye is like cupping sand when the path of least resistance is seductive and one *Enter* away.

Perhaps most concerning is that after several rounds of this analytical cosplay, one is almost entirely robbed of the ability to form an opinion. As though waking up from a dissociative fugue,[5] the question “how happy or sad or confused do I feel about these results,” becomes entirely intractable.[6] Congratulations, it appears you have Outrun Your Headlights. Outrunning your headlights refers to driving fast enough that your stopping distance is further than your lights shine. In engineering it often refers to not knowing what is going on anymore. Good engineering practices e.g. test-driven development, architecture design records (now formalised often as Skills) protect the artifact being produced, be it the codebase or product. Unfortunately, in discovery work, the artifact is NOT the analytical pipeline but the justified set of beliefs you collect over time for which no test suite exists. Indeed, there appears not to be the comparable angst about protecting the integrity of one's beliefs and intuition even though one dubious analysis could poison this well.

Furthermore, our odometer for progress has not recalibrated. That is to say, running analyses still provides the rewarding sensation of making headway with none of the earned intuition necessary to ask nuanced follow-up questions or indeed call out approaches that are suspect. This is especially dangerous given models inject assumptions in subtle ways when motivated to produce a positive result. One such concrete example I have noticed in my own bioinformatics work is that language models have a tendency to recall baked-in knowledge to facilitate discovery analysis e.g. classify cells in single cell RNA sequencing by manually printing a list of cell-specific gene markers to match against rather than using a canonical approach with a reference atlas. Anthropic's BioMysteryBench proudly touts this as model capability (which I guess in one sense is savant-like were it to occur in a human) but what I would also argue is unwanted behaviour in the context of autonomous research. Simultaneously it would be foolish to starve yourself of this intelligence substrate for the puritan ideal of understanding everything.

In sum, better solutions must be built to protect ‘belief quality’ but, in the meantime, I simply suggest we form a strong opinion with an anticipation of being proven wrong before letting the slop cannon rip. Introduce a little friction, slow down a little, and don’t Outrun Your Headlights.

Or build high beams.



  1. ^

    I want to carefully define what I mean here by intuition in the context of discovery work. I refer to the working map of territory you carry in your head about a problem that is refined by stress-testing it against counterfactuals and alternate hypotheses. It allows you to curate targeted questions that meaningfully update your priors.

  2. ^

    Yes one can argue that there was no shortage of statistical abuse and practitioners not respecting the assumptions of their instruments. However, I see this as a distinct problem from language models suggesting approaches which might be completely foreign to the user; approaches which said user necessarily have no intuition to judge any result by.

  3. ^

    The average LessWrong participant is less likely to be a victim of these effects but I describe what I believe to be pervasive in the broader research community.

  4. ^

    Yes I am aware that in some analyses such as differential expression where the number of concurrent tests is high, nominal p-values can still be a useful signal of directionality. The problem is that models err towards optimism and misuse of such 'loopholes'.

  5. ^

    I use this pejoratively; there is a formal psychiatric condition specified in the DSM-V describing sudden awakening followed by distress and retrograde amnesia for one's own identity.

  6. ^

    I will set aside the catastrophic alternative where one is happy to take a result at face value and isn’t conscious enough to even recognise that not understanding the analyses which led to the result is not okay.



Discuss

Страницы

Подписка на LessWrong на русском сбор новостей