Вы здесь
Новости LessWrong.com
NLA Thought Anchors
The following post seeks to look further into why NLA (Natural Language Autoencoders) contains the prediction more often when the original activations led to the correct output than incorrect output.
Quick Summary:- Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
- First sentence is the most counterfactually important for both activation reconstruction loss and the AV containing the final output
- Sentences counterfactually important for generating the final answer correlate with lower reconstruction loss, suggesting the AR training reward encourages the model to include correct answers
- Degenerate NLA outputs (repetition, garbled tokens, emoji blocks) appear only for activations from incorrect model responses.
- NLA response length varies more for incorrect activations, possibly reflecting model uncertainty
- Incorrect activations reconstruct ~30% worse than correct ones
- Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer
- Surprisingly when looking at activations that led to the incorrect answer the NLA sometimes had outputs that led to broken or degenerate responses examples includes repetition, garbled tokens, emoji blocks, etc. This only appears in NLA for activations that led to incorrect responses along with the fact NLA response length varies more for incorrect activations, possibly reflecting model uncertainty.
- The final answer contributes more to the NLA's reconstruction loss when the activations led to the correct output, and less when they did not.
- NLA seems to have higher reconstruction loss when the activations lead to the wrong answer on the GSM8K dataset
- The first sentence seems to be the most counterfactually important for NLA AV responses both for reconstruction loss and the response containing the final answer (contain actual answer vs model response). The counterfactual importance was more evenly spread across sentences for base activations leading to an incorrect answer.
Code: https://github.com/Realmbird/nla-thought-anchors
Huggingface datasets I created: https://huggingface.co/collections/Realmbird/nla-thought-anchors
I created a pipeline with the following steps (for further details, see the README):
Step 1 (Generates with Base model)
Step 2 (Generate first NLA explanations with AV )
Step 3 (Generate rollouts and calculate rollouts) (Takes the most time; arguments I used is a cos_sim threshold of 0.8 and 40 rollouts per sentence)
Step 4 (Analyzes the rollouts)
Other files are more to make visuals and analysis and include what step is needed to run
NLA SetupThe NLAs I used were from https://github.com/kitft/natural_language_autoencoders
Along with using the inference code with SG Lang
Base model: Qwen2.5-7B-Instruct
AV: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-av
AR: https://huggingface.co/kitft/nla-qwen2.5-7b-L20-ar
Dataset:
https://huggingface.co/datasets/zen-E/GSM8k-Aug
Experiments:NLAs are position sensitive:- I started with my original NLA script and looked at the rates of NLA containing the final answers. The rates were clearly too low; then I noticed I looked at the last prompt token instead of the token after generation; This led me to the idea that final answer appearance in the NLA corresponds to token position.
- Extraction position matters - NLA answer appearing in AV increases as the token approaches the model's final answer. For the answer and hash tokens specifically, correct activations led to the final answer appearing in the NLA output at a significantly higher rate.
- The resulting difference between the border token and answer token becomes more apparent after doing a few samples or rollouts
- These results support Ryan Greenblatt's findings that “NLA output contains what the AI will predict at a rate much higher than chance for both incorrect and correct problems”
Model Correctness Impact on NLA outputs:
- NLAs output more consistent AV response lengths if the original outputs led to a correct response. These findings imply that NLA response length varies more significantly for incorrect activations, potentially reflecting increased model uncertainty.
- The graph shows counterfactual importance from NLAs per sentence for the counterfactual impact for the sentence to generate actual answer (gold) or matching the model's answer (pred).
- Correct examples are cases where the original base model activation led to the correct response. Incorrect examples are cases where it did not.
- The first sentence is the most counterfactually important to generate the GSM8K or gold answer
- For the incorrect model response activation group shows that the most counterfactually important sentence for the model was the last sentence.
- The models seem to have the counterfactual importance for generating the correct answer to be more balanced when the activations led to the correct response.
- For correct model response group or correct examples contain gold and pred should be the same (as correct = gold)
- An interesting finding is that only when model outputs incorrect answer the AV sometimes generates broken behavior such as repetition, wikipedia, forum, etc
- See Appendix for the examples of these categories
- A question after looking at counterfactual importance for containing the answer or predicted answer is how does it correlate with reconstruction loss.
- I looked at counterfactual importance to contain the final answer by quartile and found that as the sentence was more counterfactually important the lower the reconstruction loss for the AR. This suggests that NLA reconstruction AR seems to encourage including the final answer in the AV.
- NLA responses from activations of where the model outputted the correct response have on average a lower reconstruction loss. NLA struggles to output more for incorrect responses.
- For some reason ablating all instances of the final answer only has a a larger impact in when the model outputted the correct answer than incorrect for the border token of ####. However, it did not occur in the Answer Digit Token.
- For the Answer token ablating the answer from AV seems to have a constant effect regardless of if the original activations led to a correct or incorrect output
- The higher impact of containing the answer on the reconstruction loss seems to indicate that the border token should be if you want to include the final answer in AV. However, if you do not know if the output is correct or incorrect Answer token is better due to the more consistent impact on AR reconstruction.
- Change in reconstruction loss per sentence or the range varies more greatly when the model originally generated an incorrect response.
- The change from correct to incorrect example is more noticeable on answer token over the border token
- Border token or ####
- Answer token reconstruction loss by sentence
- The NLA was for Qwen2.5-7B-Instruct and the smaller model might have highlighted issues that might not occur in bigger models. Will the incoherent AV responses on incorrect model responses happen on bigger NLAs?
- Attempt different cos sim thresholds for similar maybe this is a threshold issue
- Investigate further the impact of the models response being correct or incorrect on NLA
- Investigate with bigger models (Apply for bluedot rapid grant)
- Clustering NLA sentences and labeling them
Degenerate Examples (#prompts being broken are more on the display end trying to make it images)
- Garbled Token [:checked:checked]
- Emoji block
- Wikipedia
- Repetition
- Forum post
Coherent Examples
- Answer box
- Final Answer
- Closing_format
Mixed
- Calculator
Discuss
Lighthaven East - A Feasibility Study
As a bureaucrat, my role is to annoy my friends. Someone voices an idea, “Wouldn’t it be nice if…” or “I wonder if we could…” I make a note. I do some estimates. If it pencils out, I’ll bring it back up, week after week. The discussions are fun, but also practical. We’ll test the waters, what would be a minimum viable scheme? What’s easy, what’s hard? Who could do the hard parts? Over time the idea gets more detailed, specific, feasible. I’ll pull out a calendar. Soon our scheme has co-conspirators, action items, even a budget. It’s just good staff work.
I’ve been hearing whispers in the wind for a year now.
- “Imagine if we had something like this in DC.”
- “Where can I host an event that might get a dozen or a hundred people?”
- “It’s such a pain in the ass to book event space in the Capitol.”
- “I think this person has started to see what’s coming, where can they go to get caught up?”
- “The community seems to be growing but it’s all fragmented in group chats.”
- “How is no one planning an afterparty, that’s clearly the highest leverage intervention!?”
- “Why can’t every wall be whiteboards?”
These are all variants on a theme: “Lighthaven East.”
I did some digging. I’m happy to report that this could work. There’s strong demand. There are good options for supply. Funding, staffing, resources, property, and permits are all doable. The hard parts are diligence, agency, and will. This project needs a champion, but it’s a thing someone can simply choose to do.
Legally speaking, Lighthaven is a confusing category error. It was once the ramshackle “Rose Garden Inn,” with several buildings, a hotel license, and a history of event use. After extensive renovations, it is now a 30,000 square foot campus used for conferences, retreats, office space, and medium-term lodging. The property is owned by Lighthaven LLC and financed by an interest-only mortgage held by a philanthropist. The LLC runs the property, hosts internal events, rents conference and office space to external customers, and sells hotel stays. Lighthaven LLC is itself owned by Lightcone Infrastructure, a non-profit that among other things runs LessWrong.
Economically, Lighthaven LLC generates an operating profit comparable to its cost of capital. The mortgage is $20 million at 5% interest, for an annual interest payment of $1 million. Lighthaven LLC had $3.25 million in revenue in 2025. Events and hotel stays generated an operating profit of roughly $850k, almost enough to pay the $1 million annual interest payment. Office space seems to be offered at cost. Lighthaven LLC’s projections of $3.5 million revenue in 2026 should generate an operating profit sufficient to fully fund its annual interest payment, though bookings are currently sparse for this fall.
In practice, Lighthaven is the best event venue I’ve ever seen. I won’t belabor that point in this post, but if you haven’t been to Lighthaven, see some of its many rave reviews in this footnote.[1] Lighthaven LLC does not maximize profits–events are often experimental or designed primarily to support the Berkeley community, rather than the booking going to the highest bidder. Some event organizers are not charged, others are offered discounts on rates that are already lower than similarly sized spaces at hotels. This pricing strategy generates significant positive spillover effects and goodwill, demonstrated by the community’s strong response to Lightcone’s two fundraisers. While Lighthaven was a significant cost center for Lightcone in 2023 and 2024, by 2026 it is better modeled as supporting the parent non-profit.
Conceptually, Lighthaven is a monastery. Its main purpose is to support good scholarship “dedicated to making humanity’s future go better.” Its abbot skillfully wields an awkward mix of temporal, cultural, and political authority. Monasteries often support their ecclesiastical mission by selling craft goods such as beer, eggs, mushrooms, or furniture–Lighthaven instead sells conference space. Unlike an abbey selling produce for revenue, the conferences at Lighthaven also further Lightcone’s mission. Lighthaven’s scholars-in-residence synergize with its mission, contributing to and benefiting from the events held on the property.
These aspects combine into a whole: Lighthaven is the place to go to think out loud. Comfortable perches encourage deep thought. Inviting conversation nooks encourage you to refine your ideas with friends, themselves helpfully provided by the events and scholars-in-residence. Beautiful seminar rooms encourage you to share your ideas, refining your presentation to best convey them to others. Once your ideas are fully baked, get the word out via your laptop, the antique typewriters, or having a friend interview you in the podcast studio.
DC culture has a Lighthaven-shaped hole. Politicians have started to notice that they are confused about the future of AI. AI Policy nonprofits rent event space, mostly bars and restaurants for expensive and echoey events to grab a few minutes of staffers’ time. AI companies try to use these same spaces for technical demos, sometimes mixing beer and laptops with limited success. Technical communities of practice have unprecedented attendance as practitioners realize they need to upskill. EA and Rationalist policy organizations are scaling in DC, but each option for co-working space comes with significant downsides. Aligned conferences happen, but are held in either hotels with huge up-front costs or group houses well below their optimal attendee-count.
Resources are there to address all of these problems, people are working hard on them. But everything is scattered, hard to find. One step doesn’t necessarily lead to others. Imagine instead that someone approaching the community could have a day like the following...
A Day in the LifeOur protagonist is a tech policy staffer on a relevant congressional committee, mid thirties, has spent their career in positions of increasing authority in government and not-technically-government organizations. They’re an expert in telecom policy, or broadband, or electrical grid economics, or some other sub-field of technology policy, but now they need to learn about AI. The whole office knows there’s going to be a flood of AI bills in the 120th Congress, beginning January 2027, and there are only six people on the committee staff working on technology policy at all. Everyone needs to “get smart on AI,” immediately.
Through some coverage of a book with an edgy title, they understand that this topic is risky in some controversial way. A friend on a different committee recommended they meet with a particular non-profit. The non-profit has a few people in DC permanently, but as luck would have it, this is the week when some of the senior people are visiting from California. The committee’s available conference room only seats four comfortably, so our protagonist decides to go to them, meeting at their co-working space. It’s a lovely spring Friday in DC, it’s only a mile, it’ll be a nice walk.
When our protagonist arrives, they realize they’ve been here before. There was an industry event here last month, in the main room on the first floor, showing the capabilities of some new coding system. It seemed impressive, and they requested access once back at the office, but the Architect of the Capitol won’t let that code onto government systems for at least a year. That denial is what prompted our protagonist to gripe to their colleague on another committee in the first place, ultimately prompting this meeting.
This time, they go to the co-working space on the second floor. It seems… nice, if a bit weird. It’s hard to put their finger on why the space seems brighter and more alive than a typical WeWork. Some of the furniture is custom, fitting its space exactly without being ostentatious about it. Other pieces are clearly from Ikea, but work well enough. The space has all the cliche amenities of offices in the Bay, yet these actually seem to be used, several people are sitting on some plush carpets in a corner. There are whiteboards everywhere, just a ridiculous number of whiteboards, and even the windows… no that’s different, someone has put stained glass stickers in the top third of each. Why so many paperclips?
The meeting goes well. It narrows on a particular technical point about halfway through. The non-profit staff flag down someone walking by, who quickly clarifies that he’s with a different organization, but he joins in and within minutes is diagraming the disagreement on one of the whiteboards. It seemed silly, everyone knew what those words meant, but I guess it did clear up their confusion.
Now that our protagonist is following, they want to know more. As luck would have it, there’s a conference this weekend on-site. The monitor on the wall shows there’s going to be a session on this technical point in the evening, and a workshop tomorrow afternoon. Is it too late to register? Hmm, let’s ask the organizer, they’re probably setting up downstairs. They find him avoiding the choreographed chaos in one of the many nooks, rearranging the schedule for the seventh time. There have been a few cancellations, we can print another badge…
To start to put together something like this, we need to figure out the smallest plan that might work. I believe a Minimum Viable Lighthaven requires a few key features:
- Permanence - dedicated space that we can change to our liking, redesign to support good conversation and thought. Either owned outright or a long-term lease with stable financing.
- A large room - for speaking to all attendees of an event at once.
- Nooks - lots of smaller spaces for conversations, with at least some physical separation and sound dampening.
- Good location - a retreat venue you go to twice a year can be inconvenient, but this should be the obvious choice for a variety of events.
- Consistent leadership - an ultimate decisionmaker, whose decisions stick.
Some features that are not strictly required, but are very nice-to-have if we can, include:
- Segmentable space so that smaller groups can hold events at the same time.
- Dedicated office space, segmented into rooms.
- Good architecture.
Hotel rooms would be a mixed blessing. They would allow us to host weekend-long residential retreats, as Lighthaven does, but it’s extra space that we’d have to purchase, maintain, and manage. This could double the overall cost of the project, without necessarily doubling steady-state revenue or providing as much value as the conference space itself. If the right property comes with hotel rooms, it could be worth it, but I think we should prefer to keep them to a minimum. And more practically, DC has avoided Berkeley’s market failure in lacking hotel rooms.
…so you mean a Group House?Could a large group house qualify as a Minimum Viable Lighthaven?
Workshop House in DC is a case in point. It’s gorgeous, the residents and leadership are friendly, they host excellent gatherings directly and rent their space out to outside events. I’ve hosted a large event and several smaller gatherings there, they’re fantastic to work with. But it’s telling to see where even such a successful property and institution falls short of what we’re looking for.
The trouble is that it’s primarily a residence, the needs of the residents come first. Only about 2,500 of its more than 7,000 square feet is available for event use. Its largest room holds about 60 people, tightly. When booking, a lot of decisions need to be run by several stakeholders, getting to “yes” on specifics inevitably takes time. Their space is something the residents can graciously offer for up to a few days, as opposed to dedicated event space.
Originally, residents were excited about renting out the ground floor open space fairly regularly (e.g., for a nonprofit’s quarterly board meetings). Over time, they found that this was more disruptive than additive, and have limited outside rental or space loans only to those which overlap heavily with the interests of the existing residents and community. Outside groups are now welcome to rent the space for aligned, recurring events up to twice a year, but the house is much more interested in hosting one-off, pilot events and gatherings.
It’s tantalizingly close, but I think experience shows that primarily residential spaces don’t have the quality we’re trying to capture. Even if no one lived there, the floorplan would typically not work well. Conferences and retreats want some sort of large space that can hold everyone, at least briefly. I can’t find hard-and-fast rules, but I think this implies about a quarter of your total programming space should be a single room. Smaller houses can fit that criteria, but larger houses don’t tend to have a large single room that scales with the number of bedrooms unless it was specifically designed for entertaining. Even Lighthaven struggles by this measure, with the largest sessions of LessOnline and Manifest straining Rat Park (which holds about 300).
…so you mean a Co-Working Space?Yes and no.
From what I can find, non-profit Co-Working Spaces in our community don’t tend to be self-sufficient. NET in DC, Mox in SF, Collider in NYC, and I believe Constellation in Berkeley each use grants and/or donors to sustain regular operations. While the organization could certainly seek grants for special projects, we’d want to avoid institutional fundraising to sustain regular operations. Churn is a big part of this, people leave co-working spaces when they start working for a larger org, when their project fails, and when their project succeeds and outgrows the space. Given the many ways individual co-working users can exit, even if you manage to fill the space briefly, you won’t stay full without a strong pipeline of new entrants, which can risk the institutional culture.
Further, spaces that are primarily offices just don’t feel comfortable to use. Even when the space is comfortable for the workers, it isn’t when using the space for other events. It can feel like an intrusion, there’s a friction to moving someone’s desk aside, or sitting down at it. IMO co-working spaces make good overflow space, and reasonable break-out spaces at conferences, but should not be a majority of the space. They certainly should not intrude into the main large event hall, which should be optimized for events.
That said, I think co-working is a crucial component of this project. Having our community use this project as office space helps establish it as a default meeting space, seeds conversational serendipity, and even makes it safer by providing more eyes-on-the-street. It brings in some revenue from the property during weekdays, which are otherwise hard to rent to events, and creates a built-in audience for evening talks and events. Co-working space could be a practical substitute for Lighthaven’s Scholars-in-Residence. Co-working space is a key pillar, it just shouldn’t be the main focus of the plan.
A Minimum Viable Lighthaven DC as envisioned by this feasibility study would have three main lines of business:
- Conferences
- Professional policy-focused conferences during the work week.
- Looser, more Lighthaven-style conferences on weekends.
- Nonprofit and Corporate Events
- Mainly evenings during the work week - a staple of the DC policy world
- Co-Working for aligned organizations
- Likely including intensive fellowship programs, such as Inkhaven.
In my rough estimates, it’s difficult to make a venue self-sufficient with any one of these uses; which is why this venue doesn’t already exist. Including any two of them should be self-sustaining, even at 60-70% occupancy. Doing all three adds complexity, but each synergistically reinforces the other, three legs of one stool.
Given that, I think we want to buy a church. Failing that, a school, an embassy, or a small hotel.
We probably don’t need a campus, specifically. The climate in Berkeley is obnoxiously perfect for outdoor use much of the year. Temperatures rarely drop below freezing or exceed 85 degrees Fahrenheit. Most rain falls in December through March, leaving eight months of drier, warmer weather. But even this is underselling the usefulness of the outdoor space at Lightaven—I began writing this document in the Gazebo of Schemes on a bright, clear day that was too warm for a sweater… in February. This lets outdoor space double as programming space, significantly expanding the campus’s usable square footage and making the buildings feel more connected. When warm days transition to cool evenings, guests gather around fire pits or gather under blankets in nooks. DC is not this way. Summers are hotter and more humid. Winters are colder, occasionally with snow. Spring and fall tend to be nice, but are unpredictable. Event organizers in Berkeley can plan on outdoor space being usable, organizers in DC cannot rely on outdoor space in the same way.
When looking for a site in DC, we should consider the current zoning and historic use of the property. The Rose Garden Inn was zoned as “Avenue Commercial,” operated as a hotel, and had a demonstrated history of event use. The hotel license was included in the purchase and no zoning variances were required. The new owners continue to operate the property as a hotel that also rents conference space, there was no legal change in use. If Lighthaven had been zoned residential, it would have required a vote of the Berkeley city council to change its permitted use, adding at least a year and substantial risk of a “no” to any project’s timeline. Instead, Lighthaven operates “by right,” which should mean it doesn’t need much from the city.
In practice, Lighthaven works closely with the city, requires permission and permits for most improvements, some repairs, and even occupancy. Permits and inspections in old properties, especially those designated as historically relevant, necessarily require subjectivity. It often isn’t practical to bring a historic building up to modern code, but it is a judgement call just how much improvement to require. Unrelated matters, such as neighbors’ perception of how often Lighthaven guests park legally on residential streets, are not supposed to be relevant to those decisions… and yet… that’s just how people work. It is important to strive for good relationships with one’s neighbors regardless, but any city's politics may have more veto points than a straightforward reading of the law would imply.
How large should this property be? Lighthaven can comfortably host conferences of about 500 people, and parties of roughly twice that number, using about half its 30,000 square feet available as public space. This suggests a rough comfort estimate of about 30 square feet per conference attendee, though we’ll want to aim a bit higher if not relying on outdoor or private hotel space as pressure valves. However I don’t think we need quite as much public space as Lighthaven has. It might be better to aim for a property with about 10,000-12,000 sqft of public space; giving a capacity of 250-300 for conferences or 500-600 for evening events or parties. This would be particularly appealing on sites with options for later expansion.
So, this Minimum Viable Lighthaven would want a 12,000 sqft property with appropriate zoning, a history of event use, enough spare space to operate part of the property while other portions are being gradually renovated, with roughly 3,000 sqft of its space as one large room. This describes a church, in particular one with attached program space or a rectory. Other kinds of properties that might work include small schools or hotels, so long as they have an auditorium or other large event space. Embassies, or technically their chanceries, are another option; rare but appealing. Countries’ needs change over time, so chanceries do occasionally change hands as an embassy needs to upgrade or downsize. Mansions do not appeal unless already converted to and zoned for commercial use, such as for a wedding chapel.
This project is not necessarily dependent on what properties are listed for sale; the Rose Garden Inn was not. Lightcone Infrastructure approached its owners after scrolling satellite pictures of the East Bay on Google Earth to identify prospects. City churches often move to the suburbs as their membership ages and neighborhood tastes change, religious schools face similar dynamics. We might find a congregation willing to sell.
There are institutions building portions of this already. The Network on Emerging Threats hosts coworking space and monthly policy talks. It recently announced it was moving to a larger space, but the new space is less conducive to events. IFP and FAI host excellent large evening events on the roof of their office building, but the events have started to outgrow the space, the rentals are expensive, and FAI recently moved out of the building. EAs and Rats in the area also have a large social scene, with parties and socials most weekends, and policy-focused events most weekday evenings. A typical week has 8 or more public events scheduled, which are often constrained by the capacities of their venues. I estimate the community already spends over $50,000 in an average month on co-working and event space, not including the dedicated offices of larger organizations (like IFP) or larger, irregular conferences.[2]
Good commercial property in DC tends to go for up to $750 per square foot. (As an example, this property is larger and more ornate than ideal, but currently for sale at that cost.) Our desired 12,000 sqft should cost between $7-10 million. Interest rates are higher now than when Lighthaven was purchased, but if we can find a philanthropist willing to lend at 8% interest, even $12 million, the top of that budget plus a $2 million for capital improvements, would come to just under $1 million per year in interest. Lighthaven has $2 million per year in operating costs. A Lighthaven DC would be smaller, without hotel rooms by default, and in a region with a lower cost of labor/living. There would still be significant operating expenses, a full time director, other full- or part-time staff, supplies, insurance, utilities, etc, but I think $1 million per year is a reasonable budget.
Between cost of capital and operating expenses, the property would need roughly $2 million per year in revenue to break even. Events and co-working on the order of what the community already has today, while capacity constrained, could cover a third of this. Many existing events wouldn’t move locations, but a venue like this would be an attractive option for new events and co-working uses. Adding in some additional latent demand, one evening corporate rental per week, and one large weekend conference per month would get the property to break-even, with substantial room to improve its offerings if it can manage more bookings.
I think that a founder should not purchase property until they have secured at least the minimally viable amount of funding. There are a few key things this includes:
- Purchase of the property itself - likely secured with a mortgage
- Initial furnishing, repairs necessary to start operations.
- At least one, hopefully two, renovation projects in portions of the property.
- Enough runway in operating and mortgage costs to get to self-sufficiency.
That last bullet is likely to be the sticking point. Every project takes time to reach full operations, this would be no different. I estimate reaching self-sufficiency would take at least two years, unless cutting corners and limiting the ambition of the project to reach that milestone earlier. Depending on the size of the property, the amount of renovation desired, and how the space is configured, it could easily take three years, or even four.
I wouldn’t necessarily insist that this project have three full years of operating costs in reserve before buying a property, some revenue will come in before the property is self-sufficient. But if it were me, I would insist on at least two years of costs in the bank regardless of revenue projections. Lightcone bought Lighthaven during a time of abundant funding. When the funding situation and Lightcone’s relationship with grantmaking organizations changed, it had to run two large community fundraisers. These fundraisers were successful and gave the Lightcone team legitimacy, a broad community endorsement of their plans and strategy. But the situation was still regrettable, there was a very real risk of losing Lighthaven, along with all the resources spent to renovate it to our specific uses.
If this project is worth funding, if its director has the faith of grant-makers or other philanthropists, they should get the resources to see their vision through. The project will almost certainly look like it is failing at the 15-month mark, with renovation timelines slipping and paid bookings still scarce. Even when renovations are done, it will take some time to build a reputation as the obvious choice for certain events. It would be a disaster, a huge un-forced error, if the director has to fundraise at those points just to complete the project. There should be checks on the director and the project, but that oversight should come from the board, not intermediate project fundraising goals.
All this taken together, I estimate Minimum Viable Funding would require about $18 million dollars in total. Roughly two thirds could be in the form of a mortgage, the rest would be a grant:
- Loans: $12 million financed by a mortgage
- $10 million purchase price
- $2 million in initial renovations
- Grants: $6 million
- $1 million in startup costs - furnishings, equipment, electronics, sound, permits, legal fees.
- $5 million in operating and capital cost runway (2.5 years).
In the course of my interviews, I promoted one item to the “required” list from the “nice-to-have” bracket. Again and again interviewees brought up the quality of the leadership, that a single person should be responsible for implementing and living with their decisions. Every interviewee stressed the need for a passionate founder engaged over the long term. Several interviewees also stressed unity of command, that decisions need to have some single person who owns the choice and cannot be overruled, short of the decisionmaker being fired. I found this perspective convincing.
There will be part-time roles at a place like this, but the chief executive cannot be one. That person will need a rare combination of skills:
- Decisiveness - comfortable making decisions on the fly under uncertainty.
- Alignment - Understanding of Rationalist and EA ideas and culture, enough of a perspective and a strong enough worldview that they’re not convinced by the most recent strong personality they’ve met.
- Taste - An eye for design in the physical world. Direct design work can be delegated, but they will need to have the ability to tell why one option works and another does not.
- Political Skill - Venues like this need to be used a lot for the economics to work. The leader needs to present as acceptable to many factions of politics, to make friends easier than enemies. Needs to see and deescalate social conflicts between staff, vendors, attendees, organizers, etc. Needs to have spent significant time in DC to know, intuitively, where the political traps and mines lie. The taste to know which battles to avoid and which are worth fighting regardless.
- Energy - Days will be long, particularly when a conference is underway.
- Extroversion - Must genuinely like people and highly value the interactions between people who they may not agree with. They must value building a neutral institution more than direct work in their preferred cause area or speciality.
- Integrity - Will be trusted with tens of millions of dollars in property and accounts. It’s not enough to have integrity, they must also demonstrate it to grantmakers, philanthropists, their staff, and to a lesser extent even to conference organizers.
The leader of this project is the hardest constraint to satisfy. There’s a very short list of people well qualified in each of these categories; most have other jobs that they seem to like. I think there’s a longer list of up to a few dozen people who excel in most of these criteria, who with enough self-knowledge and humility could manage to cover for their lack in an element or two.
This job would be incredibly rewarding, personally and professionally. This must not be a volunteer role. I believe the community would be willing to pay well, on the order of $200k or more depending on experience, to have this job done well.
If the description above sounds like you, get in touch.
This property should appeal to more than just the rationalist community, Lighthaven already does. AI companies already rent bars, restaurants, and conference space to do technical demos for government and NGO staffers. There are also centrist-leaning political movements with some popular support and donor interest, who could be interested.
As Lighthaven is the cultural headquarters of the LessWrong community, a Lighthaven in DC could position itself as the cultural headquarters of the Progress Studies branch. This would give it a compelling raison d'etre: Lighthaven West generates the ideas, Lighthaven East gets them into the posting-to-policy pipeline. A focus on Progress Studies may also attract more interest from donors.
In policy spaces, opposing political factions interact socially more than many people outside DC would expect. Renting space to politically-relevant actors is tricky, we wouldn’t be as neutral as, say, a bar or hotel. But I think there is room in the center to rent to both the center-right and center-left, Anduril and Anthropic, without making too many enemies.
Crafting this coalition, determining who this space is ultimately for, is not a one-time decision. It is decided day-by-day as the director makes a series of small decisions and actions that accrue into a reputation. Which booking gets the popular weekend? Which organizations get discounts? Who are the first people invited to co-work, the second wave that builds off of that founder effect? When there are inevitable fights about associating with people one side or the other considers bad, where do they draw the line? Who’s worth defending? Who should be excluded, despite their popularity? These choices cannot be realistically delegated, because one of the skills is noticing that an issue is culturally-relevant at all.
I described Lighthaven as a monastery because its cultural output is ultimately the point. This is also true of any potential DC version. Its director will need to understand and be comfortable wielding cultural power, as much or more than the temporal power over the space. And this path will have to be plotted largely without a map.
“Lighthaven East” is a working title for the project, but should not be the name of the space itself. We will need to avoid anchoring too closely on Lighthaven’s culture and norms, for two key reasons. First, a lot of what makes Lighthaven work well is adaptation to its environment. It uses the climate well, it plays off of norms the Berkeley community has spent over a decade honing. Second, the game is rigged, it’s exceedingly difficult to beat Lighthaven at being a Lighthaven. The new project will need to build its own coalition, niche, and animating spirit, so that it can be well adapted to its new environment.
I feel strongly that it should not be named for the lead donor. I think it’s reasonable for the donors to have input into the name, location, and aesthetics. But naming the venue after a donor is too far. It’s a bit tacky, implying a vanity project, but more importantly it weakens the authority of the Director. Besides, there’s a certain cachet that comes from discovering an open secret, let people have the fun asking “Huh I wonder who’s behind [final name]?,” and then finding out.
One name that we’ve begun workshopping is “Posterity Center.” It hits several notes: focuses a long-term perspective, references the Preamble to the US Constitution, and alludes to Bayes. It’s not quite perfect. It’s a little too long, it feels like four or five syllables should be the limit, but alternatives don’t quite work; “Posterity” doesn’t feel like it works on its own, and “Posterity House” feels slightly too informal. It also seems a bit too… something… maybe self-serious? I recall, however, that I first thought the name “Lighthaven” was too pretentious even for me, and yet it has certainly grown on me.
We welcome more name ideas and feedback, and I’ll create a parent comment to collect name discussion in one place.[3]
Twitter recently discussed a coming wave of non-profit demand and funding. Could we do more with more?
In short: yes, we could!
The additional funding would have to come at the start for maximum leverage. When shopping for an initial site, more money gives many more options, later expansion is constrained to the area surrounding the chosen site. A larger space would probably need more time to reach self-sufficiency, since I expect use of a space would scale mostly independently of its size, at least at first. Stable, committed funding would be key for larger plans; larger venues would come with at least proportional increases to the operating expense, renovation, furnishing, and runway budgets. There’s an argument that larger properties might need more-than-proportional increases in non-property costs, meaning that overall project cost scales faster than total square footage, given the harder path to self-sufficiency and higher risks to the core project.
All that said, there are major advantages to larger spaces. Many benefits are obvious, more capacity for events and organizations, political proof-of-work (i.e., prestige), more option value for smaller events to spread out, avoiding the deeply unpleasant coordination failure of everyone trying to shout over each other in a small space. Others are more subtle or speculative. Lighthaven aspires to be something like the Bell Labs of old, but notably lacks project space. More space means more room, literally, to experiment with function, such as allowing overnight stays, more permanent setups for meetups and organizations, different kinds or styles of renovation, even lab or maker spaces. I would not argue that a larger project is higher EV per dollar than a 12ksqft one, but in success it would clearly be more impactful in total. In a hits-based giving model, more funding buys more potential upside, since any successful experiments could be scaled.
Another benefit, Capitol Hill has some unique features, all else equal it’s the neighborhood we should prefer. There are lots of sites in the 5-8,000 sq ft range, too small for our use, and some appealing options closer to 20,000 sq ft, but Capitol Hill is comparatively weak in the 12,000 sq ft range. More ambitious funding would make proximity to Congress more feasible.
This would be a high-profile endeavor. Because policymakers in DC would be a key target audience, failure would be much more visible and salient to them than any other comparatively-sized project. This sort of reputational damage is hard to quantify, hard to even describe, but very real. I do not think even a high-profile failure would be as damaging as the FTX collapse was to the community, but worst-case scenarios could approach perhaps a tenth of that reputational damage if handled poorly.
These risks can be mitigated, though not eliminated, with a strategy of “Don’t do stupid shit.” This is yet another reason why the director of this project needs an easy familiarity with DC culture and norms, what constitutes “stupid shit” is not always obvious to those who haven’t spent time in the Beltway. We should aim to scale the project deliberately, and somewhat quietly at first. We should inculcate an appropriate institutional culture before inviting high-profile political figures.
The organization, its leadership, and its funders would become the target of opposition research, would need to avoid certain impropriety. Yet mitigating this risk can, and often does, go too far. I personally worry that the EA community has over-learned the lessons of FTX, and tries too hard to appear normal. Those associated with the project should be and appear to be trustworthy, should be and appear to be reliable, but should not try to pretend their views are more mainstream than they in fact are. The point would be to promote outside-the-Overton-Window ideas in a high-profile, high-trust way, giving us more credibility when our ideas later prove right. The project shouldn’t sacrifice this key feature in an attempt to seem more politically palatable. We should not pretend this project is a normal thing to do. It should be weird, and edgy, and cool, just in a well-calibrated way.
Basically, I’m saying we should have robes but only break them out for parties.
More mundane risks to the feasibility of the project include:
- A lack of demand in corporate or institutional bookings.
- In general, the project could prosper without certain categories of corporate bookings. If the hyperscaling AI companies instruct their staff to stay away, the project could still be successful. However, I do not think the project would fare well without any corporate bookings at all.
- Changes in political atmosphere that lead the community to abandon its push into DC.
- Ideological capture by commercial or political interests.
- Ideological capture is both an indirect risk to reputation and a direct risk to the project’s goals. This project should fill a neglected niche, rather than becoming a tech-flavored political club.
- Too close a relationship with the AI labs could cause other organizations to pull punches or otherwise water down their messages.
- Regulatory risk from the DC government.
- The project should specifically budget for ways to be a good neighbor.
- Site risk, mold, environmental remediation, etc. Even if hazard losses are insured, they would dramatically shift the project’s timeline.
- Personnel risk, embezzlement, theft, scandal, etc.
There are three key blockers for this plan: a founder, money, and a site. In an ideal world, a founder would step up, approach philanthropists and arrange financing, then simply buy the best option available for sale. Straightforward, sequential, looks great on a Gantt chart.
It may not be that simple. Three-way matching problems are notoriously difficult, each of these elements feed back into the others. Philanthropists may not be willing to commit until an ideal site comes on the market. Some potential founders may be more willing to work with certain philanthropists or organizations, Other potential founders may be more dependent on the site, perhaps more willing to run a smaller venue, or one that is Congressionally focused.
In practice, whichever constraint is filled first will exert outsized control. If someone has a site to offer, the rest must either cohere quickly or not at all… whether the site is the best available for our purposes will be de-emphasized. If a funder gets excited before a founder, we risk a muddled vision, inconsistent execution, and different departments optimizing for different goals. These problems can be avoided if the vision comes first.
If this calls to you…
the lightcone needs you to lock in son
https://x.com/uneventual/status/1991692767001735510
- ^
Start with Every Bay Area Walled Compound.
Then, Scott Sumner:
For Jorge Luis Borges, paradise was a library. At nearly 70 years of age, I’ve found my paradise at Lighthaven, which recently hosted meet-ups for Less Online and Manifest over back-to-back weekends in Berkeley, California. I know of nowhere else on Earth where I can find so many interesting conversations in such a compact area.
Scott Aaronson, writing "Guess I'm a Rationalist Now" after his first visit:
The conference was at Lighthaven, a bewildering maze of passageways, meeting-rooms, sleeping quarters, gardens, and vines off Telegraph Avenue in Berkeley, which has recently emerged as the nerd Shangri-La, or Galt’s Gulch, or Shire, or whatever. [...] What I’ll remember most from LessOnline is not the sessions, mine or others’, but the unending conversation among hundreds of people all over the grounds, [...] It felt like a single conversational archipelago, the largest in which I’ve ever taken part, and the conference’s real point.
Many more experiences over the weekend feel almost too personal, too meaningful, to shout to the open internet: dinners and meetings and conversations with people building local cultures so achingly beautiful they feel almost like dreams, conversations stretching late into the night, serendipitous meetings with longtime friends whose faces I can now put to names.
That magical feeling of serendipity, where you can flow through a space, passing from conversation to conversation, contribute to each one in turn, and have others do the same for you.
For further design details, see this interview:
- ^
Note that this and following are rough financial estimates suitable for judging feasibility. We can elide a lot of detail, for now, since Lighthaven serves as a benchmark for comparison. A full project or grant proposal would have much more detailed budgetary estimates.
- ^
Comment here:
Discuss
Barriers to a Prosperous Future
The current race towards producing general artificial intelligence systems brings with it severe risks, yet no AI company developing frontier models is addressing these risks at a level proportional to the pace of development. The rapid integration of this poorly-understood technology into nearly all aspects of society is precarious at best, and catastrophic at worst. If progress trends continue, we will need a monumental level of investment in enhancing our robustness to these risks in the coming years. What follows is a summary of my understanding of these risks, a description of those most concerning to me, and finally what my personal plans are to mitigate them.
Types of RiskIt can be useful to categorize risks from advanced AI into three broad categories: misuse, misalignment, and systemic. Misuse refers to malicious actors—individuals, groups, or states—being enabled by AI systems to achieve nefarious objectives, by, for example, generating personalized misinformation at scale, hacking their adversaries' critical infrastructure, or building powerful weapons. Misalignment refers to AI models failing to truly obtain human values, leading to unpredictable and undesirable behaviours in out-of-distribution environments. Finally, systemic risks are those that arise from our complex and vital societal systems becoming dependent on a technology which we don't fully understand nor control, leaving us vulnerable to their unpredictable interactions.
Even AI researchers themselves understand shockingly little of how frontier AI systems reason and make decisions, and the rate of progress in this area is worryingly slow compared to the pace of development of AI capabilities, which is currently estimated to be doubling every 7 months, or less. [1]
As for "aligning" models with human values, the best techniques these multi-billion dollar companies have developed are fundamentally surface-level and have, without exception, failed in various ways, including being bypassed by clever prompting, also known as "jailbreaking". Some worrying behaviours were discussed in Anthropic's own recent report on Agentic Misalignment. If we are to entrust critical parts of our society such as education, healthcare, cyber security, and political advising to these systems, the emerging sciences of alignment and control will play a crucial role in doing so safely.
Gradual DisempowermentWhile the categories of misuse and misalignment are slowly gaining attention in public discourse, government, and academic research, in my view the most complex category, systemic risks, is currently largely neglected. Risks in this category can be subtle, developing quietly in the background, their consequences first becoming apparent when large parts of society are already dependent on these systems, by which point a large amount of the damage may already be done. Many of our critical societal systems are already complex (e.g., economies, governments, healthcare) and have been tuned over many decades to function robustly. The rapid interweaving of AI into these systems may make them harder to control and predict and lead to unintended consequences such as a reduction in human empowerment. A line of reasoning within this category that I find particularly concerning is known as gradual disempowerment, which refers to the incremental loss of human influence as a result of having more competitive machine alternatives to humans in almost all societal functions.
In the original work that introduced the term [2], the authors argue that as AI systems begin to represent ever larger shares of the labour market we might expect the economic role humans play to be reduced, and in turn so too the economic power they hold. Unlike with previous automation where humans could transition from more narrow to more complex work, AI threatens to claim all cognitive tasks, leaving humans with no higher, more cognitively-demanding roles to move to. Without labour, money will by default cease to flow to most individuals, potentially leading to drastically increased wealth inequality. Further, they argue, the economy has always been roughly tied to human preferences, where businesses only survive when they have a paying customer base. In an AI-driven economy, this tie may loosen significantly, leading to markets that cater to those systems, rather than to human values and preferences.
Additionally, they argue that as advanced AI systems become integrated into the creation and consumption of cultural artifacts, we could see our cultural norms be significantly disrupted, in a similar way to content creators today catering to "the algorithm", but amplified greatly. While previous cultural practices have always had an evolutionary pressure to in some way benefit humans, in a world where humans are no longer the only producer and consumer of culture, this "antibody" effect may be lessened, leading to potentially maladaptive cultural practices. Additionally, the apparently alluring promise of always-available hyper-personalized AI therapy, coaching, and even companionship could begin to outcompete humans, even if objectively lower quality and lacking emotional depth, simply due to the ease of access and adaptability of such systems. For a more in-depth discussion of this topic, see Kulveit et al. 2025 2:1.
Rate of DevelopmentHow much time should we expect to have to solve these problems? One important factor to consider is that AI companies are likely to invest heavily in improving their models' abilities to carry out autonomous AI research, given the immense potential economic value of doing so [3]. Even if progress is initially impeded by bottlenecks like model unreliability, human approval, and limited compute and energy, the incentives to unblock these will be so great that before long we should expect solutions to be found. Frontier AI companies are well aware of the bottlenecks to their growth trajectories, and are working hard to pave the path towards training models that are orders of magnitude larger and more capable than today's [4].
This dynamic, known as "recursive self-improvement"—AI systems tasked with improving themselves—is already happening to some degree today, and it is likely to lead to an ever-accelerating rate of development of AI capabilities as more capable models provide ever-stronger "uplift" to human researchers. If models surpass the threshold of being capable of operating largely autonomously—that is, producing hypotheses, developing efficient tests of those hypotheses, and analyzing the results to make iterative improvements to themselves—we might experience an "intelligence explosion", wherein countless digital minds running 24/7 at superhuman speeds—a "country of geniuses in a datacenter" [5]—drive rapid progress in AI research, on a timescale we couldn't hope to keep up with. For further discussion on this topic, see Forethought's Three Types of Intelligence Explosion.
For the above reasons, there is a real chance that AI systems with human-level capabilities across all fields, often referred to as artificial general intelligence, or AGI, could be developed within the coming 5–10 years, with many estimates from AI researchers and forecasting experts converging around the year 2033 [6][7]. While previous technological revolutions developed at a pace that allowed humanity to gradually adapt laws, cultural norms, and education over the span of decades, the rate of change we can expect in an AI-powered future will be entirely unprecedented and force a significant reorganization of many parts of society, possibly in an astoundingly short timeframe. Therefore, it is imperative that we greatly increase investment in fortifying all parts of society.
MitigationsIn order to strengthen our defenses against these risks, we will need to devote historic amounts of capital and effort in the coming years. We will need thorough and continuous measurement of our reliance on AI systems to have metrics to guide discourse and to use as a basis for enacting critical policy. Additionally, we will need to conduct research on how we can use AI systems in a sustainable way that benefits us in the long run.
To this end, we greatly need more research organizations like Epoch AI, measuring and forecasting AI progress, and METR, conducting in-depth research such as that described in their Frontier Risk Report. Crucially, we additionally need much more research examining the usage and impact of AI on all parts of society, such as the Anthropic Economic Index, communicated widely.
Furthermore, we need to build tools that will make our society more resilient to shocks resulting from the integration of AI. Beyond these tools and improved alignment methods that enable training models that robustly behave in pro-human ways, such as refusing to display shallow imitations of affection, we will also need significant international regulation. AI legislation has to a large degree thus far focused on present-day harms such as deepfakes and disinformation [8]. What's needed on top of this is legislation that specifically mitigates disempowerment.
Can't we just pause development of frontier AI?
While some argue for a global pause on or a mandated deceleration of frontier AI system development (such as Pause AI), I personally believe such a pause is likely unachievable, and not even necessarily a net positive. A pause could potentially backfire as a result of pushing AI development underground, to actors less concerned with safety and who would withhold progress from the public.
The challenges ahead are great, but so too are the potential upsides. We still have time to act to prevent the worst outcomes, but the window may be closing and much work is needed.
What follows is my personal plan, given the context above.
My current plan is to contribute to mitigating the above risks in three primary ways: developing my ability to conduct technical research, fostering a local AI safety community, and exploring potential mitigations to gradual disempowerment risks.
Firstly, I will greatly develop my understanding of technical AI safety in the coming months in order to get a deeper awareness of the best tools we have developed for understanding and controlling frontier models. This is where I can best leverage my career experience, though my long-term focus may shift after this period.
Secondly, I plan to continue facilitating and growing a thriving local community of concerned individuals in order to spread knowledge, enable networking, and gather a wide array of perspectives. [9]
Finally, I plan to explore the ways I can contribute to building mitigations against disempowerment. This project may initially be developed either during a fellowship or independently. Some preliminary ideas include: building trustworthy open-source coordination tools for both humans and AIs or developing further the ideas proposed in Gradual Disempowerment, focusing on other societal systems. If promising, I can imagine founding a research organization that would work to further develop these mitigations.
Last updated: 2026-05-31
- ^
METR: Time Horizon 1.1
- ^
Kulveit et al.: Gradual Disempowerment, January 2025
- ^
Situational Awareness, Leopold Aschenbrenner: From AGI to Superintelligence: the Intelligence Explosion
- ^
OpenAI: Stargate
- ^
Dario Amodei: Machines of Loving Grace
- ^
Metaculus: When will the first general AI system be devised, tested, and publicly announced?
- ^
80,000 Hours: When will AGI arrive?
- ^
- ^
Discuss
Notes on axes of variation in third-party risk assessment
There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful thinking about the space over the last few weeks.
(Thanks Ajeya Cotra and Paul Christiano for discussions that inspired most of this.)
Throughout this, I refer to the actors as:
- Developers.
- Stakeholders. These are the people who want to be informed about risks. Possible stakeholders include: governments, the public, the developer's board, the developer's employees.
- The choice matters because one of the roles of an auditor is to review confidential info that they then do not directly disclose to stakeholders, they only tell them their conclusions. This is a more important role if the developer is more concerned about disclosing confidential information to the stakeholder.
- Third parties. I don't know a better term for "independent actors who contribute in various ways to a stakeholder's understanding of risks through producing and evaluating evidence and/or arguments". Like, it's weird to call the physical security pentesting firm a "risk assessor". And AI Lab Watch isn't really an "auditor". And "evaluator" makes it sound like they run model evals.
The next step in the analysis will be to think about different objectives of third party risk assessment and think about how those interact with these axes of variation.
AxesFact generation vs evidence analysisThis is maybe the one that comes up most. I'll define:
- A fact-generation assessment is trying to answer some narrow question to produce a fact that will later be used to evaluate risk.
- Examples:
- Dangerous capability evaluations, e.g. METR and UK AISI's capability evals.
- Evaluating the robustness of a safeguard, e.g. classifier robustness redteaming or the David Rein red-teaming project.
- Pentesting to measure security.
- For many of those, a core reason why it makes sense to have a third party do them is that the process is sort of fundamentally adversarial—the final evidence will be of the form "I tried to demonstrate danger, but failed", so you're really depending on the person trying hard, and it's structurally easier to be confident that someone's not sandbagging if they're an unconflicted third party. Some cases are like monitor redteaming or pentesting where it's obviously adversarial; others, like dangerous capabilities evals, are cases where you're relying on the evaluator to do not-particularly-adversarial optimization (e.g. you need them to pick good hyperparameters for their fine-tune to elicit CBRN capabilities).
- If our claim is "you need a third party to do the optimization every time that you argue for safety by saying an optimization attempt failed", this maybe has broader implications, I'm not sure people are applying this consistently.
- For the others that reason isn't present. For example, for a cyber capability eval where the AI company basically does no extra elicitation, it seems like it would probably be fine if the AI company just ran the evals themselves—the produced fact isn't rendered particularly suspect by the fact that it was done in house.
- There are some other reasons that it might make sense to have a third party do the eval:
- They have expertise the AI company doesn't have in house. This totally makes sense in principle; it could totally make sense to centralize the expertise somewhere outside the company so that the AI companies can share it.
- The eval requires sensitive info that the evaluator doesn't want to give the AI co. E.g. sensitive CBRN stuff? This makes some sense.
- In general, eval scores can generally be increased or decreased by a developer if they have access to the dataset, because they can train/iterate against it. But in many cases the developer can also game eval scores even without access to the dataset; e.g. they can probably mess with alignment evidence by training models to act aligned in weird situations. If an eval score won't be valid evidence without separately understanding whether the AI developer did anything that would compromise it, it's not clearly much worse for the AI company to have the eval themselves and run it themselves.
- Examples:
- An evidence-analysis assessment tries to combine many forms of evidence into an overall judgement about a high-level question. The central example of such a question is "how much risk to the world does this deployment pose".
- Examples:
- METR's review of Anthropic's sabotage risk report. (The sabotage risk report analyzes a narrower slice of risk, but it is an analysis that draws together many different facts to answer a natural question that's pretty close in the causal graph to "how much total risk is there". Unlike e.g. "how robust was this monitor", which requires a bunch of additional context to be interesting.)
- As a narrower example, our review of the OpenAI CoT training was us synthesizing a body of evidence into an overall assessment.
- This is what auditing is in the context of corporate accounting—they look at your books and ask you questions to assess the overall financial state of the company.
- Within evidence analysis, there's a distinction between reviewing of a developer's argument vs. producing their own.
- Examples:
I think that evidence analysis vs fact generation are pretty different and it's not obvious that any projects or organizations should do both. The skillsets seem plausibly very different. Like, I suspect that during the main risk period, only a pretty small subset of the facts relevant to the risk analysis should be generated by third parties (as opposed to being reported by the company). So someone doing evidence analysis will naturally be operating several steps in the "evidence tree" away from someone generating a fact. And so I'm not sure there's that much synergy for these happening as part of the same project. E.g. I don't think it would be particularly useful for the security pentesters to work at the same org that does the overall assessment.
I think it's pretty weird that these are conflated so often (e.g. in lots of the widely-cited papers about AI auditing or evaluations like Frontier AI Auditing and Model Evaluation for Extreme Risks), they seem really different to me.
Laundering private evidence into sharable conclusions?One function of risk analyses is to inform a stakeholder about the risks posed by a developer. One strategy is to have an auditor who access private information from the developer and uses it to answer a question, and then they give the answer to that question to the stakeholder without telling them the private information that caused them to reach that conclusion. The other option is to do the risk analysis just based on information that the stakeholder already knows (or is cleared to know).
In this post, I call this “evidence-laundering”. I don’t mean this with a negative connotation and will probably choose a different term in future because the negative connotation is so strong. I just wanted a distinct term that emphasizes that information is being processed from private evidence to public conclusions to reduce disclosure of commercially sensitive information.
I think I want to define this as: did the conclusion of a report rely on facts that were not stated in the report? So it's not evidence-laundering if you report a straightforward fact that your readers need to believe that you're not straightforwardly lying about because they can't personally check it. But it is evidence-laundering if you don't report the facts and instead report "based on private facts, things look ok".
Examples of the evidence-laundering strategy:
- Pen-testing. It's totally normal to have a pen-tester say "we looked for vulns; i won't tell you want they are but here's how bad they were". And for the pen-tester to not tell the stakeholder various sensitive details.
- David Rein's pen-test of the Anthropic monitors.
- METR's review of the Anthropic sabotage risk report.
- Lots of private evaluations in other parts of society. In courts, a psychiatric evaluation involves the psychiatrist talking privately to someone and then stating the conclusion publicly without fully justifying it.
Examples of the just-public-evidence strategy
- Our review of the OpenAI CoT training thing. The workflow here was that they shared the draft with us a week or so before they published it, and we went through a few rounds of me complaining that the draft didn't have some important info that I needed to be convinced of their conclusion, and they iteratively added more info to the draft until I was pretty happy. So even though there was some temporary secrecy, we were fundamentally not in a laundering role: the thing that we released was based entirely on information that any reader of our blog post could read themselves. We were under our standard NDA, so they could have told us stuff during this process that we wouldn't have been allowed to repeat.
- AI Lab Watch. I think Zach sometimes emailed AI companies asking them questions that he said would affect their rating, and maybe they sometimes answered? As long as this works via him entering their answers into the public record, he's not playing the evidence-laundering role.
- The METR Frontier Risk Report is almost entirely this. Their report is entirely based on info that AI companies agreed to share publicly, except that some of these facts are published without attribution (that is, METR doesn't say which company said something), so METR is playing the evidence-laundering role of verifying that an AI company really did say something without revealing who said it.
- This is an interesting case because METR collected a bunch of non-public info during this project, but the only way this entered their report was by companies agreeing to make subsets of the info public.
- Evidence-laundering is a central part of what "Frontier AI Auditing" (2026) means by "auditing", which writes: "Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential".
The core strategic question here is definitely "when should you use evidence-laundering vs transparency".
- I think that transparency is probably better all else equal? Evidence laundering should probably be reserved for cases where it's extremely defensible for the developer to not want to publicly release certain info. I am extremely sympathetic to this for pentesting or other things where there's an imminent cost to releasing the full evidence. I am less sympathetic to e.g. developers wanting to keep the capabilities of unreleased models secret.
- But I think it might be pretty good if we can rely on transparency as much as possible.
- In particular, I think that transparency maybe makes it easier for risk evaluations to differentially pressure AI companies to do better. I am excited for project like AI Lab Watch that do risk evaluations that compare the AI companies. This might piss off the AI companies, so it might be hard for them to have an evidence-laundering relationship. Their ratings will be more legitimate if they have the info they need to make their assessments. AI Lab Watch has the stance "if you don't give public statements explaining what you're doing on issue X, we'll assume you maximally suck"; perhaps stakeholders would think that it's reasonable to expect developers to produce public evidence of safety and therefore the developers will publish their evidence publicly. This allows us to have a robust mechanism for between-company pressure on safety for longer than if we tried to do all this without transparency, using evidence laundering.
A few notes:
- Reporting a score against a benchmark the evaluator built but hasn't released — "DeepSeek scored 40% on our private benchmark" — isn't laundering of the developer's information (you know the claim and whom to blame), but it does ask the reader to trust, sight unseen, that the evaluator built a reasonable benchmark. The developer itself may not even have access to check it. Releasing an i.i.d. subset of the benchmark resolves most of this.
- A report has to disclose how it looked. Every conclusion implicitly relies on the unstated fact that nothing else relevant turned up. So a clean report should say how it searched for other kinds of facts — including disconfirming ones — and report that there weren't any (or what it found). A report that presents its supporting facts and stops is laundering the adequacy of its own search.
- Whose secret? Company vs auditor laundering. It helps to split laundering by who owns the withheld fact. Company laundering hides facts about the developer's own system — details of security systems that a pentester is attacking, or training methods that an auditor wants to know about. Auditor laundering leans on something the assessor holds and won't share: a particularly strong example of this is hazardous knowledge like bioweapon-uplift details. Note that auditors very often have some information that they don't want to publish. Note that needing to trust the auditor did a competent job — e.g. elicited hard enough — is not laundering at all; that's ordinary trust in a person, not a hidden fact. The table below splits these into two columns.
Should you give bad scores to developers if they don't give you sufficient access, or should you just use your best guess? If you do the latter, then anyone performing surprisingly poorly is better off not disclosing this. But if you do the former, then your risk assessments can’t be taken at face value by third parties, and it’s easier for AI developers to discredit you by saying “that org has no idea what’s going on, obviously we’re way safer than they think”. This probably works especially well when they’re trying to discredit you to their employees or other groups who have more access to private info. As an example of that dynamic, AI Lab Watch initially rated Google poorly on security due to them not disclosing much about their security, and this led some GDM people to say they thought it was less credible or usable.
As I said in the previous section, it's maybe rough to get stakeholders on board with risk assessments that assume the worst, but maybe it's doable and I think we should plausibly try to achieve this.
Current risk vs preparednessAre you answering a question like "are the current risks adequately handled" or are you answering "is this developer on track to handle risks later"?
The downside of analyzing current risks is that they're inconsequential and will probably be inconsequential up until shortly before they're severe.
The downside of analyzing preparedness is that you have to be much more opinionated about futurism and threat models; your reports will rely on assumptions that are much more contentious.
Cross-developer comparabilityAre you trying to make it easy or hard to compare between companies?
- AI Lab Watch and friends are trying to make it easy. The advantage of this is that it might pressures companies to behave better or shift resources towards better companies (e.g. because people prefer working at companies they believe to be more responsible).
- METR's FRR went out of its way to make it hard. The advantage of making it hard is that companies are more willing to tell you stuff. (One way of thinking about this is that if you remove comparability, scary stuff the AI companies say has costs that are split among them and their competitors, which is way better for them on net than the costs being concentrated on them. In practice, AI companies also get some cred from safety people for producing evidence of scary stuff, so this effect hasn't mattered that much in practice.)
I think that cross-developer comparability makes much more sense if you're doing evidence analysis, because evidence-analysis assessments are more comparable between companies.
So kind of obviously, I think that the choice between these is basically a tradeoff between different goals: if you mostly want to pressure AI companies to behave better, then comparability is good; if you mostly want to inform stakeholders about the overall level of risk, then comparability is probably bad if you were also trying to do evidence laundering, because it makes it more costly for developers to share info with you.
Examples, classified against the axes aboveProject
Fact gen vs evidence analysis
Company laundering
Auditor laundering
Current vs preparedness
Cross-developer comparability
Reviewer vs producer
Dangerous-capability evals (METR, UK AISI)
Fact gen
No
Yes
Now
Easy
Producer
Classifier-robustness red-teaming for misuse prevention
Fact gen
No
Yes
Now
Easy
Producer
David Rein's red-team of Anthropic monitors
Fact gen
Yes
No
Now
Hard
Producer
Security pen-testing
Fact gen
Yes
No
Now
Hard
Producer
Fact gen
No
No
Now
Easy
Producer
Apollo in-context scheming evals
Fact gen
No
No
Now
Hard
Producer
METR review of Anthropic's sabotage risk report
Evidence analysis
Yes
No
Now
Hard
Reviewer
Redwood review of OpenAI CoT training, External review of DeepMind scheming-inability safety case
Evidence analysis
No
No
Now
Hard
Reviewer
Both
Slight (non-attribution)
No
Now
Very hard (deliberately)
Producer
AI Lab Watch, FLI AI Safety Index
Evidence analysis
No
No
Both
Easy
Producer
SaferAI risk-management maturity ratings
Evidence analysis
No
No
Prep
Easy
Producer
GovAI third-party compliance reviews (proposal)
Evidence analysis
Yes
No
Prep
Med
Reviewer
Brundage et al. AAL framework (proposal)
Both
Yes
No
Both
Easy (via AAL scale)
Both
The columns are far from independent: the whole table can be recovered by a short chain of single-question splits. The tree below is the chain that minimizes expected questions-to-classify. Note that auditor laundering occurs only at the fact-generation end, and evidence-analysis assessors only ever launder company secrets or nothing.
Discuss
The main impact from automated AI production: concentration of power?
There’s a lot of talk about automated AI R&D and the like. It’s been discussed since at least 1965 when statistician I.J. Good coined the term ‘intelligence explosion’:
an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make. — Good, 1965, Speculations concerning the first ultraintelligent machine.
Since the 2000s, a few researchers (recently increasingly many) have taken seriously the prospect that some kind of intelligence feedback mechanism could kick in around the time that AI becomes competent enough to contribute to its ‘own’ (or rather, its successors’) development.
This is increasingly recognised as a prospect we should take seriously. It’s been discussed somewhat furtively by leading AI developers for many years, but in late 2025 it became a public talking point. Sam Altman, CEO of OpenAI, recently set a target of fully automated AI-improving-AI by 2028. Jack Clark, co-founder of Anthropic, writes ‘AI systems are about to start building themselves’.
I’m not going to address the resulting prospect of speedup here, which I think is real, though perhaps surprisingly modest.[1] (If, like some of my friends, you think I’m underweighting this prospect, you should probably be even more concerned about concentration of influence.[2])
Rather, this note is briefly making the rather simple (but I think rather overlooked) observation: plausibly automated AI R&D would enable a (perhaps severe) concentration of influence… and this could be the most important effect to be paying attention to.
It’s a fairly obvious point once made. A more thorough analysis might shed light on how likely this is, the most important consequences, and any leading indicators to pay attention to. I’ll only make some starting gestures toward these.
As I presaged in Engineering a Safer World,
R&D automation…: fewer human participants means fewer whistleblowers, less internal scrutiny, less governance and decisionmaking robustness. More concentration of that influence. That could mean more single points of failure. It’s famously difficult to maintain a conspiracy of more than one or two people: ‘two can keep a secret if one is dead’, as they say! And besides conspiracy, compared to larger teams, individuals and small groups may be far more susceptible to capture, coercion, corruption, or plain foolishness and rash decisionmaking.
Whose influence?To be clear, you don’t need to be concerned about any specific people currently in positions of particular influence to think that concentration could be very impactful (and concerning).
The wisest supreme leader is still a single point of failure[3]. Especially during a period of tumult, leaders can be displaced and reporting structures revolutionised. Concentration of influence — even the prospect of that concentration — simultaneously raises the incentives for corruption and power struggle (while raising the stakes for the rest of us). Meanwhile, power corrupts and a centralised decisionmaker can rarely account for all the relevant considerations, even if they mean to. In AI, we’ve already seen company leadership falling out over differences of opinion, fighting it out in the boardroom and in court, and even clashing with governments. Turf war over the means of production of AI (which may increasingly equal the means of production full stop).
The point is that losing checks and balances can be a problem whoever is at the top[4]. As in politics, so in the heights of industry — which, today, and increasingly, firmly includes AI.
Suffice to say that I imagine the kinds of roles in the AI oversight chain which might be (nominally, initially) preserved longest are senior research directors, company executives, and closely-involved regulatory or executive government officials.
Perhaps more important: whose influence lost? Directly, the skilled human labour which today is involved in researching, engineering, and deploying frontier AI systems. Via them, their wider research networks, social ties, and any societal oversight they might have provided (e.g. whistleblowing to journalists or authorities). Plausibly also the marketers, account managers, even lawyers and others who interface with what a company is doing with its AI products, and their networks. (If not themselves replaced, they’d lose collegial understanding and influence via the replaced researcher-implementers). In short: everyone else.
Sketched using claude.ai
Concentration without (overt) maliceWhy shed the researcher-implementers, the deployment engineers, and others (or sideline them from the important ‘real’ responsibilities)? It doesn’t have to be a power-grab, at first (or even at all) — it’s just good business! Human employees (especially software experts) are incredibly expensive compared with AI.[5]
Beyond this, analogously to areas where AI already surpasses humans (like chess), human participants may positively get in the way, slowing and compromising what would not only be a cheaper, but a more effective AI-only workflow.
A company might choose to instead eat this cost — perhaps it prefers to exhibit loyalty to current staff, acknowledges concerns around erosion of human oversight, or is inherently conservative in that way. But in a competitive environment (either commercially or in strategic and security terms), that may be tantamount to inviting defeat or irrelevance, if competitors are charging ahead.
At some point shareholders and other stakeholders might think it irresponsible not to embrace a replacement AI workforce. Ultimately this could add up to far fewer people having insight into what’s going on or a say in its direction. Even if you can’t (or don’t want to) immediately fire them, you might deliberately or even passively sideline your human employees as more and more work is carried out by machines.
Influence over what?Notably, board, investors, auditors, journalists, and regulators (who already often suffer an information scarcity) — and the broader society they represent — could also be substantially cut out if logging, analysis, PR, and incident reporting are automated… This might only require the usual levels of corporate paltering rather than special malice. It might even be prosocially-motivated if the relevant actors are (justifiably or not) concerned about a power-grab by government. (It could of course also be deliberately selfish.)
Conversely, a government or regulator which strong-arms its way into the midst of the development process might itself become the locus of concentrated influence. (This in turn might be superficially or genuinely motivated by national security concerns.) The more automated a firm, the easier it may be to seize, either overtly or through more subtle channels. When employees are critical pieces in a production process, that’s a taller order.[6]
Sketched using claude.ai
I’m not suggesting any of this leads (immediately) to influence over all the world (though if you do take over the world, please consider my friend Owen’s advice for what to do if you find yourself in that situation). Nor even over an entire economy or state. But minimally, this looks like highly concentrated influence over a frontier AI company: presently a rare, increasingly lucrative, and burgeoningly politically-influential machine.
Activity at that frontier — the decisionmaking by leading AI developers — stands a chance of being among the most economically, politically, and strategically important activity in the coming years. If the cat is out of the bag with AI, that frontier, steered wisely, might be an important part of a societal resilience and readiness process (compare the recent Project Glasswing in cybersecurity). Used foolishly or even maliciously, the frontier might degrade societal capacity right at a time when it could be most important.
Influence… or power?To repeat: I don’t think it’s given (or even particularly likely) that the starting possessors of a hypothetical concentration of influence would remain in control. From the point of view of a rogue AI, for example, control over a frontier AI company might be a delicious opportunity for expansion, defence, and societal dependency — ideal routes to escalated un-unpluggability. Likewise from the point of view of a power-hungry politician or corporate psychopath. Notice that corruptibility and coercibility are perhaps more plausible with fewer hands: it’s less practical to threaten a whole team or company than a single person.
So if this hypothetical automation-driven concentration of influence is a concentration of power, I don’t expect the ultimate wielder to be worthy of it. This marks a true difference of opinion with some others I know, who might think it more likely that, say, some current AI company leader could both accrue and keep hold of this level of power, and also use it wisely.
The obvious paths to hard power via AI are in surveillance — newly scalable with AI analysts scrutinising as many people’s movements as desired — and in development and automated production of flexible, high-range tools of coercion like weaponised drones. Other kinds of power might come from positions of economic leverage or narrative and propaganda dominance, conceivable with advanced-enough AI and without societal defences.
Concretely, what might be different?Let’s really quickly gesture at two cases which might be substantially different in a world where AI production is largely automated.
Imagine a near future: frontier AI company revenues are in the 100s of $billions or $trillions, AI services are used in civil services, executive decisionmaking, faith leadership, … That’s a lot of points of leverage for subtle manipulation. Suppose a leader (selfishly or under pressure) intends to inject secret loyalties into their products. If they’re sole controller without checks and balances, that could be trivial, with obvious concerning consequences. Without automation-replacement, there are perhaps dozens or hundreds of people with reasonable oversight per major release, and several people even for any given small change. That’s a far more difficult system to twist.
Alternatively, consider a company on the brink of truly autonomous, general, human-surpassing AI: a huge responsibility. Many ambitious but underwrought alignment targets have been suggested for such a project, many of them foolish, perhaps fatally so. A leader hubristic without serious checks might withdraw from critique, double down (perhaps sycophantically egged on by AI toadies), and drive through a ruinous agenda. Society at large might be none the wiser (or at least insufficiently alert and able to intervene). A leader willingly taking risks (perhaps with a selfish or misguided vision of unfathomable rewards) would be even more of a problem. On the other hand, a leadership more dependent on staff and teams may be subject to more psychological and cognitive support, scrutiny, moral encouragement, and so on. A literal healthcheck. Of course, human teams can be subject to concerning groupthink (not unheard of at AI development companies!), but the more off-base a suggestion is (ethically or pragmatically), the more likely it is that the collective intelligence of the team corrects it.
Any number of scenarios can be considered; rarely do they appear more promising from a societal perspective if a single or a few leaders are insulated from scrutiny and critique.
What’s the outlook?It looks very unclear to me. Plausibly it’s in the balance and decisions and conversations now will shape how this plays out!
There are some reasons to think that (near-) automation of AI production could instead produce equal or even improved societal oversight.
For one, the sometimes unintuitive effect of automation is to increase per-worker productivity, making it more worthwhile to bring more humans into the process (even on myopic financial grounds). Historically this effect has anyway often presaged a sharp decline once automation reaches a sufficient level. This might make concentration actually more difficult for a time. I don’t expect this to play out for very long in AI frontier research[7], but I’m somewhat uncertain about this.
Directed sensibly, a glut of researcher-grade AI worker-equivalents could be tasked with analysis, scrutiny, decisionmaking robustness, even auditing and whistleblowing. There’s nothing in principle preventing this, for any given level of trustability in our AI. Perhaps we’ll end up bringing more people into the process in the increasingly important, complementary roles of oversight, decisionmaking, and direction-setting. These folks could be made more effective with AI tools and assistance. That’s a decision to be made.
As I suggested earlier, leaders perhaps soon faced with these temptations may balk, either at the moral prospect… or at the pragmatic prospect of being subsequently replaced, captured, or corrupted in the ways I hinted at.
Perhaps it’s a tenuous balancing act. Internally to a given AI producer, we might want to guarantee a large enough ‘in the loop’ workforce that we can trust internal deliberation and whistleblowing and so on to keep things on the rails. Outside that, we might want to avoid no-holds-barred proliferation of AI-producing capability — but we probably want at least a few developers to keep each other in check and reduce monopoly or kingmaker dynamics.
I don’t expect any such concentrations of influence to play out overnight. Loss of control is a process, not a moment. AI development organisations — and society at large — should be paying attention and having the necessary conversations. Much to look out for, and perhaps much to look forward to!
- ^
Task-relevant data and compute look like the more biting bottlenecks, and research taste accrues mainly through expensive experimentation.
- ^
(unless you expect such accelerated singularity that influence prior is all that matters)
- ^
And may I be forgiven for slandering the current crop of potential supreme leaders as not very wise
- ^
I would note that we usually need some concentrated executive and representative power in various roles in society — but we want those roles to be sufficiently monitored, and we want the processes which move people into (and out of) them to be sufficient to produce worthy selections.
- ^
This ‘humans expensive, AI cheap’ point is worth taking some care over. Contemporary frontier AI training is getting more expensive by the year. Running the best AI can be increasingly expensive, because ‘thinking harder’ is an effective way to scale capabilities on current margins. But once a capability is unlocked at the frontier, it typically becomes rapidly and radically cheaper, due to ongoing rises in compute efficiency and the ability to distill long thoughts into quicker reflexes. Further, you only need to train AI to do something once, and then it can be copied and run as many times as you want, it doesn’t need to sleep or take family time or get sick etc.
- ^
For example, witness the periodic furore among tech company employees when their work is used toward military ends (latest).
- ^
One reason is that this research does not scale very well in parallel, so returns diminish steeply per worker — and we’re already imagining AI that can substantially slot in as a cheaper replacement worker in most cases anyway.
Discuss
A Song About No
When Lily was about two she told me she wanted a song about "no". This was ten years ago, and I don't remember why she wanted this, but I made something up:
This is a song about no.
This is a song about no.
This is a song about no, no, no.
This is a song about no.
The song goes... No no no no no no no no no.
The song goes... No no no no no no no no no.
The song goes... No no no no no no no no no, no, no.
This is a song about no.
It's useful anytime the kids want a song about something I don't know a song about. For example, it often served as a song about "turtle". Of course the more syllables the subject has the harder it is to sing, but that just makes it more fun.
I applied my nascent music writing skills, and tried to set it down in dots:
Discuss
Visualize Cyclical Structure in Llama Model
Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what structures we can find for different concepts in LLM’s. I made a tool allowing anyone to generate and visualize 3d structures for some arbitrary concept. The tool currently has support for cyclical and smooth structures, extracting activations at any given layer in the Llama-3-8b model. Check it out here:
https://neural-geometry.vercel.app
IntroductionTraditionally, mechanistic interpretability has focused on steering models with linear direction vectors. Underpinning this is the Linear Representation Hypothesis, which states that all features in language models are collections of sparse linear directions. While linear activation steering has proven useful to shift model behavior, it can also produce unexpected behavior. When the scalar's magnitude becomes too large, the language model often breaks down, producing garbled and incoherent text. Personally, I do not think that the linear representation hypothesis is sufficient to explain what is going on mechanistically in these large language models.
Recent research from Goodfire has focused on exposing the smooth geometric structures that form for certain ordered concepts, such as days of the week, seasons, age, temperature, and more. Recurring ideas like days of the week and seasons form smooth cyclical structures, whereas linear concepts such as temperature and age form arcs. The work here is directly inspired by Goodfire's paper and research.
One of the most important steps in this pipeline is extracting meaningful activation vectors from carefully constructed prompts. For example, Llama may be prompted, "What day comes 3 days after Tuesday?" or "What day comes before Thursday?" Ground-truth labels are simply the correct answer to the prompt. Activation vectors are extracted from the last token to capture maximal semantic meaning.
The neural geometry tool used the GPT-4o-mini model to generate a multitude of prompts with ground truth labels for any given concept. These prompts are often favored for cyclical and linear concepts, where semantic meaning can easily be captured. One potential issue with this style of prompting is that we may be imposing a cyclical structure through our choice of words when there otherwise may not be one. For example, prompts such as "What day comes two days before Wednesday?" when repeated across every weekday, may impose a cyclical structure when there otherwise may not be one. It would be interesting to explore whether the same structure emerges when a different style of prompting is used.
Projection Down to 3DRaw activation vectors are first projected down to a 64-dimensional space. Here, the vectors are averaged for the same ground truth labels to produce centroids. These centroids are more stable in terms of their semantic meaning than any individual vector alone.
We do not project directly down to 3D; 64D serves as an intermediate step. The purpose of projecting down to 64 dimensions is to reduce most of the noise that exists in a high dimension, like 4096. Important semantic meaning may only be encoded in a small number of dimensions. Computing centroids in 64D is also much more computationally stable than computing in 4096. 64 is a somewhat arbitrary dimension; 32 and 128 can also work just fine.
A smooth curve is fitted through the centroids that are produced in 64 dimensions and sampled at 300 different points. The resulting spline curve is then projected down to 3d via PCA. This is the smooth curve that is displayed on the website after the full pipeline is finished.
ConclusionOverall, I find this to be one of the more interesting research directions coming out in the field of mechanistic interpretability. While linear directions have been shown to causally influence model behavior, assuming that all features are simply linear direction vectors seems to be a nascent theory of interpretability and alignment in general. Rather, the structures within language models are likely much more complex and have nonlinearities, which is why models break down when steered too far. I am excited to see where this research leads in the future. In the meantime, I hope this tool can make this field of research more accessible for everyone.
Discuss
Why I think evals are pretty important and most worth working on (for me)
An application response I wrote! Feel free to leave feedback!
What are you most concerned about when it comes to risks from AI?
I’m most concerned that many people will be harmed very soon, and particularly that we won’t know why. Since politics and government dictate public life, solving said harm would require palatable translations of technical and sociological knowledge by experts for institutional changemakers to act on.
However, evaluations meant to exact such knowledge are systematically unreliable. Anthropic’s BrowseComp “benchmaxxed” by independently both achieving eval awareness and managing to scrape the specific eval it was being tested on. Human-designed audits may also structurally indicate evaluation (Gao and Kreiss), with distribution shifts making evaluation paradigms systematically inaccurate to general, out-of-lab use. Even seemingly optimistic advancements (Constitutional Classifiers) demonstrate the insufficiency of pure output-level evaluations, as safety now necessitates interpretability of internal activations.
Capabilities risks in particular are also fast accelerating, enough to potentially saturate even robust metrics like METR’s time horizons (Cotra). Certain MLE advancements (Joo et al.) are regarded “surprisingly” effective; implicitly, neither designed for nor necessarily predictable from first-principles. New findings like the Platonic Representation Hypothesis even conjecture that unintended capabilities improvements are actually systematic, as multimodal models converge to a shared statistical representation of reality. Some scholars (LeCun et al.), by contrast, purport that instead of convergence to general intelligence, various models trained to various specialisations will constitute a more legible and steerable “Superhuman Adaptable Intelligence”. This, however, is exactly my concern: not only are domain-specific superhuman intelligences structurally impossible to oversee (novice-grandmaster problem), but aforementioned evidence shows broad, unintended capabilities may arise from ostensibly unrelated changes or optimisations. Capabilities aren’t decomposable into what we can measure for, and certain framings of AGI (or “SAI”) like LeCun’s might obscure that fact.
AI systems as information/thought filters have sweeping social impacts; empirical studies show systematic LLM bias in news summarisation (Savgira et al.). Widespread adoption may enhance manipulation of public opinion and structurally constrain “responsible AI” within institutional profitability (Mitra). Broadly, my concerns regard our capabilities/risk evaluation methodologies and ontologies being systematically wrong. Without enforcing reliable ground truths, we risk suffering every technical problem at once as scientific voices may fail to move institutions away from trajectories of harm.
Works Cited
BrowseComp: https://www.anthropic.com/engineering/eval-awareness-browsecomp
Gao and Kreiss (case study of gender bias in LLMs): https://arxiv.org/pdf/2509.04373
Constitutional Classifiers: https://www.anthropic.com/research/next-generation-constitutional-classifiers
Cotra: https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities
Joo et al.: https://arxiv.org/pdf/2602.15322v1
Platonic Representation: https://arxiv.org/pdf/2405.07987
LeCun et al.: https://arxiv.org/abs/2602.23643v1
Savgira et al. (no link): What Stays and What Goes: Auditing the Impact of LLM Summarization on News Partisanship. Pavel Savgira, Elisa Kreiss, Homa Hosseinmardi. CHI conference on Human Factors in Computing Systems: Late Breaking Work 2026.
Mitra: https://disjunctionsmag.com/articles/why-leaving-big-tech/
Discuss
Financial Costs of an AI Pause?
I’ve analyzed the near-term economic effects of an AI pause, out of concern for my investments, and a desire to predict how strong political opposition to a pause is likely to be.
My median estimates: The S&P 500 will drop 27.8%. AI subsectors will drop 34-69%. Interest rates will rise at a much slower rate than would be the case without a pause.
The specific numbers depend on some fairly arbitrary assumptions. So please read this post in order to get a feel for how the results depend on the assumptions. I’ve tried to keep the assumptions reasonable, but some of them will prove to be wrong. My most controversial assumptions reflect an expectation that both markets and voters will be surprised at how powerful AI is, mainly in 2027.
For the full model, along with many explanatory comments, see the Python source code here (zip file).
This conversation with Claude clarifies my reasoning in more detail than most people will want.
Sensitivity to AssumptionsHere’s how my model says the impact is influenced by changes in assumptions. Numbers are for the immediate change in the S&P500 and in the stocks of hyperscalers (Microsoft, Google/Alphabet, Amazon, Meta, and Oracle).
AI economic centrality
low (0.4)
medium (0.7)
high (0.95)
SP500
-22.5%
-27.8%
-31.8%
Hyper
-26.8%
-34.1%
-40.4%
How central is AI to economic growth? Low means AI is one technology among several. High means AI is the dominant driver of growth, and interest rates are pushed up by massive AI-related capital demand.
training dependence
low (0.25)
medium (0.5)
high (0.8)
SP500
-22.2%
-27.8%
-31.2%
Hyper
-26.3%
-34.1%
-38.9%
How much of AI’s near-term economic value requires new frontier training runs? Low means most value comes from deploying and refining existing models. High means the next major value unlocks require fundamentally new capabilities.
S&P 500 immediate impact: AI centrality × training dependence
Train: low
Train: medium
Train: high
Central: low
SP500
-18.9%
-22.5%
-25.0%
Central: medium
SP500
-22.2%
-27.8%
-31.2%
Central: high
SP500
-24.6%
-31.8%
-35.9%
post pause cognitive improvement
weak
moderate
aggressive
SP500
-24.5%
-27.8%
-30.3%
Hyper
-29.5%
-34.1%
-37.7%
I’m predicting that at the end of the pause, AI training would resume with some moderate regulation. This variable captures how much progress to expect compared to a no-regulation scenario. I’m using 75%, 50%, and 30% of unregulated progress for the weak, moderate, and aggressive post-pause regulations respectively.
compute growth rate
40%
63%->40%
80%
SP500
-24.1%
-27.8%
-30.8%
Hyper
-29%
-34.1%
-38.4%
This variable describes the hardware constraints on AI growth. It represents how fast the available compute would increase given unlimited demand. The middle column assumes that growth gradually slows between 2028 and 2040 from 63% to 40%.
pause duration
1 year
2 years
4 years
SP500
-27.5%
-27.8%
-29.6%
Hyper
-33.7%
-34.1%
-36.7%
task density
low
moderate
high
SP500
-27.1%
-27.8%
-28.4%
Hyper
-33.2%
-34.1%
-34.9%
High task density means there’s still plenty of low-hanging fruit, and we haven’t yet reached the steepest part of the S-curve.
Low task density means we’re using up the low-hanging fruit, and we’ve passed the steepest part of the S-curve.
I assume that governments decide in late 2027 to treat AI as slightly more dangerous than nuclear weapons, due to some combination of job displacement, and accidents that are more concerning than the one depicted in the movie 2001: A Space Odyssey.
I focus on a scenario where an international agency enforces drastic limits to AI development for two years, starting at the beginning of 2028. During 2028 there is an expectation that significant AI development will resume in 2030. I will focus my analysis on the economic effects during 2028, and assume that actors during 2028 will only have rough guesses as to how fast development will be allowed to proceed in 2030. I will assume that their mean forecast involves some sort of resumption of progress, but little confidence in full-speed development being allowed in 2030.
The pause will restrict all datacenters more powerful than a certain threshold, roughly corresponding to the level of the best AI which was released in 2026. The details of the pause will depend somewhat on insights that won’t become available until we know more about how AI is progressing in 2027.
Instead of predicting what the pause will apply to or how it will be enforced, I’ll predict that it will be effective enough to slow AI capability progress by a factor of 5 compared to what it would be in the absence of regulation. Since regulation will be imperfect at distinguishing between harmless research and the research it intends to pause, I’ll estimate it causes a 15% slowdown in other high performance computing.
ImpactI assume the initial effect of the pause will be a 90% reduction in spending to train AI. That effect will be somewhat offset by lower prices on that compute stimulating increased demand for inference.
I assume that GDP growth rates under a no-pause scenario would gradually rise to 30% by 2040. This is more a reflection on what markets would predict in 2028 than a genuine estimate of what I expect would happen with no regulation. I see a fair amount of room for higher growth rates.
I predict interest rates in 2029 of roughly 7% with a pause, compared to 11% without any AI regulation.
I predict that robotics progress will continue to have roughly the same increases in economic impact that it would have had without regulation. I’m moderately confident that AI already has nearly enough general intelligence for robotics to have transformative impacts on the economy, and that the remaining engineering that is required is ordinary enough to only be slowed down a little by the pause. That slowdown will be offset by the pause reducing the extent to which AI training competes with robotics for resources.
I’m assuming that financial markets are mostly rational, and will adjust price/earnings ratios mainly in reaction to predicted growth rates and interest rates. I assume markets will briefly over-react to a pause due to increased risk aversion and margin calls. I assume that pre-pause market levels would not be considered to be bubble-like under a no-pause scenario.
Here’s a more detailed set of predictions for the median set of my model’s assumptions:
Model Output: Executive Summary Pause period: 2028–2030Post-pause cognitive improvement: 50% of unrestricted rate
Model horizon: 2040
Net present value of foregone AI revenue: $ 111.93T
Implied AI sector market cap loss: $ 839.51T
S&P 500 immediate impact: -27.8%
S&P 500 after one year: -17.6%
Immediate market cap change: $ -20.02T
One-year market cap change: $ -12.69T
AI High-Growth Segment Revenue Trajectory Year | No Pause Growth | With Pause Growth | Diff
--------------------------------------------------------------------------------------
2025 | $ 350B 63.0% | $ 350B 63.0% | $ 0B
2026 | $ 570B 63.0% | $ 570B 63.0% | $ 0B
2027 | $ 930B 63.0% | $ 930B 63.0% | $ 0B
2028 | $ 1.52T 63.0% | $ 911B -2.0% | $ 604B << PAUSE
2029 | $ 2.43T 60.5% | $ 1.22T 33.4% | $ 1.22T
2030 | $ 3.84T 58.0% | $ 1.67T 37.7% | $ 2.17T
2031 | $ 6.01T 56.4% | $ 2.29T 36.7% | $ 3.72T
2032 | $ 9.31T 54.8% | $ 3.10T 35.6% | $ 6.20T
2033 | $ 14.26T 53.2% | $ 4.18T 34.6% | $ 10.08T
2034 | $ 21.61T 51.6% | $ 5.58T 33.5% | $ 16.04T
2035 | $ 32.42T 50.0% | $ 7.39T 32.5% | $ 25.03T
2036 | $ 47.98T 48.0% | $ 9.69T 31.2% | $ 38.29T
2037 | $ 70.05T 46.0% | $ 12.59T 29.9% | $ 57.46T
2038 | $ 100.88T 44.0% | $ 16.19T 28.6% | $ 84.69T
2039 | $ 143.25T 42.0% | $ 20.61T 27.3% | $ 122.63T
2040 | $ 200.55T 40.0% | $ 25.97T 26.0% | $ 174.57T
Sector Market Cap ImpactsSector | Pre-Pause | Immediate % | After 1yr %
----------------------------------------------------------------------------------
Semiconductors | $ 8.00T | $ -4.50T -56.2% | $ -3.96T -49.5%
Hyperscalers | $ 22.00T | $ -7.51T -34.1% | $ -5.23T -23.8%
Frontier Labs | $ 3.00T | $ -2.07T -68.9% | $ -1.92T -63.9%
Ai Applications | $ 4.00T | $ -1.74T -43.6% | $ -1.36T -34.1%
Non Ai Sp500 | $ 35.00T | $ -4.20T -12.0% | $ -215B -0.6%
----------------------------------------------------------------------------------
TOTAL (S&P 500) | $ 72.00T | $ -20.02T -27.8% | $ -12.69T -17.6%
Implications
Political pressure for AI regulation is building as increasingly impressive evidence of AI capabilities erodes peoples’ ability to dismiss AI as hype. I expect this to lead to a serious debate among politicians in 2027 about AI regulation. I’m unable to predict what kind of regulation that will produce. So I’ve focused on scenarios that would matter the most if they’re adopted.
The economic impact of a moderately effective pause would be big enough to create medium-sized political pressures to weaken the pause.
There will be significant pressure for a strong pause due to voter concerns about job losses. There will be hard-to-predict pressures from national security professionals related to military risks.
My crystal ball refuses to tell me how these pressures will play out.
I see a very real chance that a debate over a pause will impact AI stocks within a year from now. This effect is worrying enough to get me to take some profits in my AI stocks, at a rate of 1% to 2% per week given recent trading patterns.
I consider a pause to be more likely than do most people. Here are some Manifold markets that I’ve been modestly bidding up:
- Conditional on humanity surviving to 2035, will a global AI pause have been enacted?
- Will there be a global pause on the largest AI training runs at any point before AGI?
- [Carlini questions] 6+ month pause on AI development
- Will an international pause on large AI training runs be in effect on Jan 1, 2028?
Discuss
Food, water and power from thin desert air
Wavelength-selective agrivoltaics are here: solar panels that are transparent in the wavelengths chlorophyll can use, while generating electricity from other wavelengths. Meanwhile, the harvesting of water from even air people consider dry is an area of intensive development.
Active systems use electric heat pumps to chill surfaces on which dew then collects -commercialized by companies such as Watergen. More intriguing are passive harvesters. One recently developed in Switzerland manages to chill itself by enhanced infra-red radiation. In Shanghai, Dr. Ruzhu Wang's team has developed salt-infused hydrogels that pull in water at night, and release it when heated by the sun
Put these together, and you can build greenhouses that meets their own needs - and more - without access to a power grid, or any supply of liquid water. Arid sunny places are seen through new eyes. Namibia, for example (where moist air rolls in each night and there are beetles that perch on the tips of dunes, allowing dew to form on their own bodies); or the Atacama desert, the Sinai peninsula, or waterless islands like the Galapagos.
China - with huge expanses of arid land, not considered arable - is on the case. When economies of scale are brought to bear on the elements of these systems, desert towns that make their own food, water and power from what the desert offers in abundance seem destined to spring up like mushrooms.
Discuss
Agriculture needs another revolution
Vertical farming has the potential to unlock multiplicative yield gains per area of land and catalyze development of new technologies (precision farming, rapid genetic engineering, robotics based automation, etc.). Making this transition will exponentially benefit humanity (and other denizens of earth) in multiple ways.
However, the current focus of indoor / vertical farming is fresh produce. Here, I make the argument that in order to reduce reduce agricultural land use and obtain maximal benefit of vertical farming, we need to primarily focus on cereals, pulses and oil crops, and not just fresh produce.
BackgroundThere are several important reasons to reduce agricultural use of habitable land use:
- Agriculture use is still driving deforestation across the world.[1][2]
- Preventing further wildlife biodiversity decline, reverting back agricultural land to natural habitats and making earth equitable for the other denizens of earth[3]
- Reforestation still remains an effective, viable option of scaling up carbon removal and preventing large offset climate change.[4][5]
Currently, about 50% of the habitable land on Earth is used for agriculture.[6]About 33% of the total agricultural land is used for growing crops (crops for food, feed, biofuels), with the rest 67% used as grazing land for livestock. For this post, I will only focus on the land use for growing crops.[7]
Proposal: Scale Yields Multiplicatively by Growing Food VerticallyIn essence, agriculture is massively spread out over large regions of land. Because of this, inefficiencies creep into the system making it harder to:
- Maintain consistent yields throughout. Although this averages out globally, the variabilities (critically, losses) are often borne by farmers. This makes agriculture a very uncertain sector, which will be made even more uncertain due to the effects of climate change.
- Genetically engineer crops for higher yield and viable seeds. Soil, water and climate conditions vary drastically across the large area of land use making it harder to genetically engineer crops to provide consistently high yield everywhere.
- Roll out precision agriculture technologies. Embedding sensors to monitor key parameters such as water and nutrient sufficiency while detecting (and possibly eliminating) pests scales with land area.
- Deploy automation and robotics. Robotics requires a constrained environment, at least currently. Agriculture spread over a large area, generally with animal activity, makes it hard to build robots to help with monitoring and harvesting.
- Efficiently transport agricultural yield. Losses can occur over multiple stages (processing, storage, transport, etc.), with typical losses being around 10-20% but sometimes going as high as 40%.[8]
A common solution to all this is to figure out how to grow plants and trees, i.e. any flora of agricultural interest, in indoor environments of large multi-stored buildings. This automatically provides the following benefits:
- Crop yield scales by the number of floors in the building, automatically producing multiplicative yields and can drastically reduce land use. In contrast, the global green revolution unlocked a 2.5x yield increase over the last five decades[9]
- Precision agriculture and robotics becomes easier by designing buildings with monitoring and automation in mind. Climate controlled buildings increase reliability under the effects of climate change.
- Spatially concentrated high yields allow self-sufficiency over counties, states and countries, making transportation and supply-chain management easier and reducing post-yield losses.
- This is a necessity for humans to become a space-faring civilization ;)
However, I make a couple of key assumptions:
- Using multi-storied buildings is cost-effective in the long term, even through they have much higher upfront costs. My rationale is that the advantages of monitoring, climate-control, yield optimization and automation would eventually allow for lower costs per yield.
- Power costs will be substantially reduced in the future. Moving agriculture indoors would require generating light, maintaining climate, etc, which would require substantial power. I am assuming that newer technologies with high power capabilities like nuclear reactors (possibly fusion reactors) would reduce power costs substantially in the future.
Most of the agricultural land is used for growing cereals/grains, oil crops, and pulses (~85-90% of the land; Fig. 1).[6]To make effective change, we need to focus on decreasing the land use required to grow these crops.
Currently, the focus of indoor and vertical farming is to grow fresh produce. From what I gather, the reason for focusing on fresh produce is because of ease of growth in soil-less systems (like hydroponics, aeroponics, etc.), ability to scale them vertically to increase yield, and a high rate of spoilage loss, necessitating local centers of fresh produce.[10]
However, produce currently make about 3% of the total land use, so they are unlikely to make an effective dent. In addition, methods developed for these plants will likely not be generalizable for cereals, oil, and pulses. To effectively reduce land use for agriculture, the key focus should be on developing technologies that enable mass indoor farming of cereals, pulses and oil crops.
Fig 1: Agricultural land use by major crops (data source: our world in data)
This essay began as an initial question/thought of why vertical farming has not taken off and what are the key bottlenecks that is preventing this transition. A dive into this made me realize that while there is a huge potential for indoor farming to be a foundational and transformative, the current focus on it is not effective. To effectively deploy indoor farming and reap its multiplicative benefits, the focus must be on cereal, pulses, and oil crops that use ~85% of the land used for agriculture for human consumption.
I would love to hear what LW community thinks about this :)
Hannah Ritchie (2021) - “Drivers of Deforestation” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260518-093348/drivers-of-deforestation.html' [Online Resource] (archived on May 18, 2026). ↩︎
Accounting for deforestation and land use, it seems like carbon released for agriculture is equal to (or greater than) the carbon released due to fossil fuels. Paper: Increased transparency in accounting conventions could benefit climate policy - https://doi.org/10.1088/1748-9326/adb7f2; Video. ↩︎
Hannah Ritchie (2021) - “To protect the world’s wildlife, we must improve crop yields — especially across Africa” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260518-093348/yields-habitat-loss.html' [Online Resource] (archived on May 18, 2026). ↩︎
Carbon sink capabilities of forests: Pan, Y., Birdsey, R.A., Phillips, O.L. et al. The enduring world forest carbon sink. Nature 631, 563–569 (2024). https://doi.org/10.1038/s41586-024-07602-x ↩︎ ↩︎
Current advances in carbon removal technologies - refer to executive summary, point 2: The state of carbon dioxide removal: a global, independent scientific assessment of carbon dioxide removal. University of Oxford. https://doi.org/10.17605/OSF.IO/F85QJ ↩︎
Hannah Ritchie and Max Roser (2019) - “Half of the world’s habitable land is used for agriculture” Published online at OurWorldinData.org. Retrieved from: 'https://ourworldindata.org/global-land-for-agriculture' ↩︎ ↩︎
Reducing livestock related land use would require change in food preferences (covered in detail here) and/or development of plant-based meats (focus of good food institute). ↩︎
- ^
Number based on actual loss section in Post-harvest losses and Hannah Ritchie (2020) - “Food waste is responsible for 6% of global greenhouse gas emissions” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20251125-173858/food-waste-emissions.html'
Changes in cereal production yield in the last decade: https://ourworldindata.org/grapher/index-of-cereal-production-yield-and-land-use ↩︎
https://www.ars.usda.gov/oc/utm/vertical-farming-no-longer-a-futuristic-concept/ ↩︎
Discuss
Outrunning your headlights
This is exactly the right place to probe. Gromov-Wasserstein is genuinely dimension free. Partial and semi-relaxed are precisely the mechanisms for the abstention/coverage problem we have. Want me to make a new branch and run run_entropic_gwot and invoke semi-relaxed GWOT between your models’ RDMs?
Wtf does that even mean? Eh, could be interesting to see the result. Enter
A peculiar side effect of model intelligence in discovery-based research is that it’s possible to run every statistical/quantitative analysis under the sun, burn millions of tokens and gain little intuition on how to make headway into your problem.[1]
Pre-AI, the effort associated with constructing a pipeline to run any meaningful quantitative analysis incurred a real cost. Indeed, it forced one to consider whether the analysis made any sense at all. People who had the necessary skills to execute were often thoughtful about their analysis by necessity. You couldn’t just apply Mendelian Randomisation if you didn't know of its existence; there was effort required to both articulate your question to search for the right tools and de-risk by considering whether a tool fit your question.[2] This forced one to earn some intuition for how it worked and what to expect. The pain that such friction inflicted is now alleviated by the CLI agentic coder of your choice. Some would argue that this frees humans to spar with an intellectual partner at a new layer of abstraction rather than concern themselves with the specifics of code. I would argue that a productive spar requires both parties to hold opinions they are willing to defend and on a day-to-day basis this is not what I see.
Instead, the trajectory of decisions feels almost pre-ordained, primarily driven by three effects. The first is this veneer of problem-specific competence as models string shibboleths together. Shibboleths often serve as a heuristic for expert opinion e.g. septic patient with febrile tachycardia versus high temperature and heart rate. Deferring to expert opinion (or what masquerades like one) feels only natural. The second is an illusion of control as each recommended option is contrasted against straw-men alternatives. The third is sycophancy (one less obvious than the verbal submissiveness), an insidious bias towards writing tests that are likely to produce favourable empirical results or worse interpreting them in such way as to support the framing you seem to desire.[3]
Great news, 90/126 nominal p-values are significant. Would you like me to proceed with the write-up? [4]
Holding on to one’s critical eye is like cupping sand when the path of least resistance is seductive and one *Enter* away.
Perhaps most concerning is that after several rounds of this analytical cosplay, one is almost entirely robbed of the ability to form an opinion. As though waking up from a dissociative fugue,[5] the question “how happy or sad or confused do I feel about these results,” becomes entirely intractable.[6] Congratulations, it appears you have Outrun Your Headlights. Outrunning your headlights refers to driving fast enough that your stopping distance is further than your lights shine. In engineering it often refers to not knowing what is going on anymore. Good engineering practices e.g. test-driven development, architecture design records (now formalised often as Skills) protect the artifact being produced, be it the codebase or product. Unfortunately, in discovery work, the artifact is NOT the analytical pipeline but the justified set of beliefs you collect over time for which no test suite exists. Indeed, there appears not to be the comparable angst about protecting the integrity of one's beliefs and intuition even though one dubious analysis could poison this well.
Furthermore, our odometer for progress has not recalibrated. That is to say, running analyses still provides the rewarding sensation of making headway with none of the earned intuition necessary to ask nuanced follow-up questions or indeed call out approaches that are suspect. This is especially dangerous given models inject assumptions in subtle ways when motivated to produce a positive result. One such concrete example I have noticed in my own bioinformatics work is that language models have a tendency to recall baked-in knowledge to facilitate discovery analysis e.g. classify cells in single cell RNA sequencing by manually printing a list of cell-specific gene markers to match against rather than using a canonical approach with a reference atlas. Anthropic's BioMysteryBench proudly touts this as model capability (which I guess in one sense is savant-like were it to occur in a human) but what I would also argue is unwanted behaviour in the context of autonomous research. Simultaneously it would be foolish to starve yourself of this intelligence substrate for the puritan ideal of understanding everything.
In sum, better solutions must be built to protect ‘belief quality’ but, in the meantime, I simply suggest we form a strong opinion with an anticipation of being proven wrong before letting the slop cannon rip. Introduce a little friction, slow down a little, and don’t Outrun Your Headlights.
Or build high beams.
- ^
I want to carefully define what I mean here by intuition in the context of discovery work. I refer to the working map of territory you carry in your head about a problem that is refined by stress-testing it against counterfactuals and alternate hypotheses. It allows you to curate targeted questions that meaningfully update your priors.
- ^
Yes one can argue that there was no shortage of statistical abuse and practitioners not respecting the assumptions of their instruments. However, I see this as a distinct problem from language models suggesting approaches which might be completely foreign to the user; approaches which said user necessarily have no intuition to judge any result by.
- ^
The average LessWrong participant is less likely to be a victim of these effects but I describe what I believe to be pervasive in the broader research community.
- ^
Yes I am aware that in some analyses such as differential expression where the number of concurrent tests is high, nominal p-values can still be a useful signal of directionality. The problem is that models err towards optimism and misuse of such 'loopholes'.
- ^
I use this pejoratively; there is a formal psychiatric condition specified in the DSM-V describing sudden awakening followed by distress and retrograde amnesia for one's own identity.
- ^
I will set aside the catastrophic alternative where one is happy to take a result at face value and isn’t conscious enough to even recognise that not understanding the analyses which led to the result is not okay.
Discuss
Links #2: 2026/05 Part 2
- ^means articles I read in full, otherwise assume I skimmed it
- I show my discovery graph in (via …) blocks, those without (via …) usually come from my RSS reader, or the algorithm in the corresponding website
- This is approximately a 1 in 20 filter of content
- This is very disorganized, but hopefully still useful.
- Sometimes quotes are not in quote blocks, but should be obvious in context.
- Links in quotes are sometimes removed.
- The rule is that a link goes to the bottommost relevant heading, i.e. an engineering related article on LessWrong goes to engineering
How I would use my linkpost
- Sometimes, the only thing worth reading is the title! Read it and move on.
- For HackerNews entries, if you choose to read the article, also ask an LLM for things that are worth reading in the comments
- Beware systematic selection biases:
- I mostly don't read AI policy stuff
- Very engineering centered
Stop blaming social media for the vibecession
Satya Nadella on funding OpenAI
From: Satya Nadella
Sent: Wednesday, July 13, 2022 9:47 AM
To: Amy Hood, Jon Tinter, Mikhail Parakhin
Subject: RE: [EXTERNAL] deal thoughts
Overall I want us to own – the silicon, infra, foundational model IP and “know how”. Right now we are a very thin layer on top of NVIDIA and all the IP is with Open AI. And we have a P&L that will lose 4 bil next year!!! I have not seen anything like this in my 30 years in our industry. I can justify it all by saying that Open AI has smart people and NVIDIA has a lock etc. But if we are going to spend this kind of money and not have control of destiny, it makes no sense. Better to be an investor and not even take all this execution risk!
I want to spend this money. The infra and HW/Silicon work at the “system” level need to have a proprietary edge (and show up as positive GM in Cog Services). And we need to have a foundational model team that is self – sufficient at all time and has the “know how” of taking what Open AI does and productizing it. As long as we have an internal org/investment model/open AI deal terms that all compose to achieve these two goals, we can take all kinds of other risk around monetization etc.
Will be good use the P&L review to get on the same page/context here so that we can all solve for what is our best Open AI deal terms in the context of our overall AI roadmap/plan.
https://x.com/StatisticUrban/status/2056013072926745069
Here's an example of ongoing human physiological change: some people have a third artery in their arm. Some don't.
~10% of people born in the 1880s had the third artery, but ~33% of late 1900s babies have one, and a 2025 Australian cadaver study found it in ~43% of upper limbs.
Sean Manning’s lexicon (via https://statmodeling.stat.columbia.edu/2026/05/16/sean-mannings-lexicon/)
Pythagorean Addition
tl;dr: Instead of labouriously computing c=\sqrt{ a^2+b^2 }, we can mentally calculate using the alpha-max plus beta-min algorithm, by estimating \hat{c} = max(a, 0.9a+0.5b)
https://www.theargumentmag.com/p/homework-shouldnt-be-all-fun-and
My wife and daughter spent this weekend playing through Opus Magnum, an engineering and programming puzzle game. But this isn’t “a game, presented as effectively a bribe for tolerating the occasional math problem.” Instead, it’s “math, which is so fun that people were able to produce it and sell it as a game.”
https://idiallo.com/blog/how-to-talk-to-your-coworkers: Translate and Repeat
How to convert between wealth and income tax (paul graham): A wealth tax of 1% is equivalent to an income tax of 20%.
beluga whales pass the mirror test
The one genuinely mark-directed behavior came from Natasha, who repeatedly pressed the marked area—behind her right ear—against the mirror. Without arms, she couldn’t point. It’s the strongest data point in the study, but a softer kind of evidence than a chimp or an elephant typically delivers.
Why Japanese companies do so many different things (via https://news.ycombinator.com/item?id=48237163)
you have a firm that has lots of lifetime employees who can’t be fired, and whose skills are tailored to what your firm needs rather than to a particular occupational category transferable to any employer
the system only makes sense if the company is also insulated from outside pressure
the J-firm (Japan-style company), run by its employees and largely indifferent to the interests of shareholders, exists simply to continue existing
And that basic impulse toward survival is why Japanese companies are so insistent on diversification. If you’ve made a commitment to keep people employed for life, then you need to create jobs for them if their current jobs stop making sense
If you’re not very worried about profitability, and have lots of well-trained generalist employees, then it makes perfect sense to reinvest your company’s earnings by expanding into new industries
The human eye can see 39620 Hz (via https://arstechnica.com/gaming/2026/05/you-probably-dont-need-a-1000-hz-gaming-monitor/)
The 60Hz (30-90Hz in different studies) that human can see is the Critical Flicker Frequency, which is when humans stop seeing lights flickering. This actually already translates to 120Hz in a real screen because one flicker = one black frame + one white frame.
Flicker artefacts
contrast-reversal: half black half white screen, alternating - median 540Hz flicker
motion blur: estimated 700Hz
phantom array effect: max 19810Hz at extremely bright LED and rapid eye movement
troboscopic effect: ~600 Hz, at cinema level movement speed, which is a bit slower than gaming peak speeds
Is "colorectal cancer" rising in "young people"?
No:
Colorectal cancer is going up in young people.
Yes:
Various kinds of cancer are going up in later generations. (Definitely at younger ages, possibly at all ages.)
https://gwern.net/backstop (via https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC/tsvibt-s-shortform?commentId=5iApMXH7djr5tE2eA)
many systems can be usefully described as having two (or more) levels where a slow sample-inefficient but ground-truth ‘outer’ loss such as death, bankruptcy, or reproductive fitness, trains & constrains a fast sample-efficient but possibly misguided ‘inner’ loss which is used by learned mechanisms such as neural networks or linear programming. (The higher levels are different ‘groups’ in group selection.)
So, one reason for free-market or evolutionary or Bayesian methods in general is that while poorer at planning/optimization in the short run, they have the advantage of simplicity and operating on ground-truth values, and serve as a constraint on the more sophisticated non-market mechanisms.
The US installed backdoors in hardware for specific targets, from June 2010 internal NSA newsletter article (via https://arstechnica.com/tech-policy/2026/05/china-banned-rtx-5090d-v2-while-nvidia-ceo-jensen-huang-was-visiting/?comments=1&post=44432551#comments)
https://news.ycombinator.com/item?id=48308216
I saw kids spend many hours a day watching automatically generated videos. Not always AI-generated, sometimes it's AI-assisted and procedurally generated.
It is quite unbelievable how vulnerable weaker minds, for the lack of a better term, are to AI content.
I saw a group of 3-8 yo kids spend hours watching obviously procedurally generated content that is completely random and contentless: it was more about an intense rhythm, imagery of violence (animated stick figure motorcycle accidents with blood and slow-down effects at random points), a lot of movement, chaos, very short inserts of people laughing hysterically on some middle-eastern tv show and similar. Brainrot doesn't feel like hyperbole for this content.
Another time, I saw an 80 yo lady watch a doctor sit in front of the camera and speak about a health topic for 45 minutes straight. Only it's not an actual person, but a convincing AI avatar: his gestures and face match what he is saying, the voice is convincing too, but for the 45mn he doesn't make any movement that is not a gesture lastin 1-3 seconds. And his tone of voice has no variation that is longer than a few seconds either. If you fast forward, he always looks the same. It's all extremely monotonic. The lady couldn't believe that it's not a real person.
https://news.ycombinator.com/item?id=48319912
I have a Kia that's networked (since disabled). I did a GDPR data request and after a couple of weeks they sent me numerous CSV files and I was a little amused at some of the data fields.
Here's some examples I thought aren't for my benefit.
- How long I let the car warmup before driving after every start, - max speed, - acceleration rates, - Lateral acceleration around corners tagged with GPS data, - every GPS datapoint, - destinations and exactly when I set off and arrived
https://news.ycombinator.com/item?id=48327176
Game dev here, have worked on AAA and indie.
First off let me get on my high horse and say the engineering in video gaming is generally more complex than the engineering I've done working in big tech. You need a lot more creativity and ingenuity to solve the unusual problems you run into in gaming.
From there, as others have said, it's a simple supply and demand issue. Nowadays I am a university professor, nearly every student who comes in wants to pursue one of the three fields: cybersecurity, video gaming, or recently ML/AI.
This shouldn't come as a surprise, they want to work on the things that influenced them and shaped their experiences so far. There's an absolute over supply of students who want to make video games.
Gaming, like most of entertainment, is a passion-driven industry. You trade good salary for your name in the credits. You trade nights, hobbies, marriages, and your health for this opportunity. That is unless you reach that lofty 1% of developers who are too valuable to be fired.
Not all areas of gaming are like this. Gambling, like working on slot/pachinko machines, pays very well and has pretty realistic work-life balance. However every student I've talked to about this has universally said "no I don't want to make slot machines. I only want to work on GTA/Stardew Valley/Hollow Knight/Fortnite."
There's seriously no shortage of starry-eyed students who are willing to accept minimum wage to solve SDE3 level problems. I was one of them once.
https://news.ycombinator.com/item?id=48325340
Mistral being in Europe is disadvantaged with:
- Money: less diverse private pension fund environment = less LPs to invest in VC funds = less VC dollars to invest in new ventures. European money is vacuumed out of the private sector into state pension funds and dumped into low yielding government bonds. This starves the private sector of capital while inflating the % of GDP driven by government spending every year (government pension funds buying government bonds in circular fashion enable runaway deficit spending...just like circular AI infrastructure spending).
- Talent & compute: due to #1, Silicon Valley can outbid Europe for the best talent and hardware. Watch an OpenAI launch video and listen to all the European accents.
- Local market fragmentation: Europe is a collection of countries that pretend to work together while not even having a unified capital market. The average EU citizen can barely communicate with their neighbor in a common language beyond the level of a toddler (english fluency is massively overstated by Americans who only experience tourist capitals).
- Regulatory disadvantages: In everything from company regs, employee regs, unions, privacy regs, data portability regs, etc.
Use AI This Election
I’m not saying AI is superintelligent or can decide better than you can. I’m saying that if you - like me - spend an hour or so doing research before voting on local seats, AI can aid that research very effectively. And if you don’t do that research - because you weren’t willing to waste an hour on it before - AI makes it so much faster that you might want to start.
I gave Claude a prompt something like (edited for coherence):
I’ll be voting in the June 2026 California primary. I’m a centrist liberal abundance YIMBY whose favorite political writers are Kelsey Piper, Matt Yglesias, and Ezra Klein. I’m wary of government overreach, but I’m not a doctrinaire libertarian and want to help people when we can figure ways to do it that work. I’m going to ask you about each race on my ballot, and I’d like for you to list the various candidates’ bios, policies, endorsements, your read on the most important differences between them, and your advice for me as I try to make my choice.
Researchers develop a new process to get lithium out of rocks
In today’s issue of Science, however, a research team has identified an energy-efficient means of extracting lithium from rocks. The process they’ve designed uses far less energy than existing ones, regenerates all its starting chemicals, and produces byproducts that could also be sold.
While the process gets rid of the high temperatures for the initial processing of lithium-containing ore, there are several steps with elevated temperatures needed further down the line, both for the lithium and for the useful aluminum and silicon products. So, the researchers did a full economic evaluation of how their process stacked up to what’s already on the market.
The existing process, which involves roasting ore/sulfuric acid, came in at just under $9,000 for each usable tonne of lithium. By contrast, they estimate that the new process should only cost a bit over $5,000 per tonne. That’s roughly comparable to the cost of isolation from high-quality brines. If the silicon and aluminum products can also be sold, then the cost of the whole process would drop by over $1,000, making it highly cost-effective.
Severed appendages of sea cucumber species Psolus fabricii don’t seem to die
LessWronghttps://www.lesswrong.com/posts/MqTwaZDZDgNGRaXus/you-can-opt-out-of-allergies
You can fix seasonal allergies with subcutaneous allergy shots (subcutaneous immunotherapy, SCIT). There's also tablets and drops you can take (sublingual immunotherapy, SLIT). At about $1000 and months, and for injections, many doctor visits.
https://www.lesswrong.com/posts/uYXjSHmHyjbNzuZqk/brain-structure-and-iq-how-myelin-elevates-intelligence
Haier et al. (1988) was the first study to combine modern brain imaging techniques with psychological intelligence testing. The surprising result he found was that people with higher intelligence were using less brain energy, somewhat counterintuitive to the idea of more intelligent people having more mental horsepower or brain power. This led to a major insight into the nature of intelligence: intelligent brains are efficient brains. This is the Neural Efficiency Hypothesis. In essence: smarter brains have higher signal to noise ratios. In the original paper Haier speculates:
This inefficiency may be due to the use of more energy by each neuron and/or the use of more neurons to perform the task. The inefficient neural circuits are used intensively to try to solve the problem ~ are unable to do so, possibly because extraneous, irrelevant circuits are used.
https://eliasschmied.substack.com/p/social-agency (via https://www.lesswrong.com/posts/qH9mZjJnA3paxkhNF/daemonicsigil-s-shortform?commentId=npCtGQAzfrLKkse7R)
DaemonicSigil: the post asks how much of our ability to plan comes from our brains being designed to plan, and how much is purely learned (by social imitation of other people's planning, or explicit instruction from others on how to plan). It answers that a surprising amount is purely learned. (This summary does not do the post justice and you should really go and read it.)
Shortformhttps://www.lesswrong.com/posts/7mnQixaWC747dm76h/tomas-b-s-shortform?commentId=dGrTrP2CvF9QGiABf
Pangram flagged the winning entry for the Commonwealth Short Story Prize (Caribbean region) as 100% AI generated.
https://www.lesswrong.com/posts/2aZCSAsFDqwRQpAPp/sunny-s-shortform?commentId=JQvxTtL3nArgS8Maz: Four articles about trying more things
Executive Clock Speed
Mo Putera: Relatedly, Scott Alexander's ACX Grants 1-3 Year Updates:
Someone (I think it might be Paul Graham) once said that they were always surprised how quickly destined-to-be-successful startup founders responded to emails - sometimes within a single-digit number of minutes regardless of time of day. I used to think of this as mysterious - some sort of psychological trait? Working with these grants has made me think of it as just a straightforward fact of life: some people operate an order of magnitude faster than others. The Manifold team created something like five different novel institutions in the amount of time it's taken some other grantees to figure out a business plan; I particularly remember one time when I needed something, sent out a request to talk about it with two or three different teams, and the Manifold team had fully created the thing and were pestering me to launch a trial version before some of the other people had even gotten back to me. I take no pleasure in reporting this - I sometimes take a week or two to answer emails, and all of the predictions about my personality that this implies would be correct - but it's increasingly something that I look for and respect. A lot of the most successful grants succeeded quickly, or at least were quick to get on a promising track. Since everything takes ten times longer than people expect, only someone who moves ten times faster than people expect can get things done in a reasonable amount of time.
https://www.lesswrong.com/posts/J3mbSgcgbGAuF2yLk/firmament-s-shortform?commentId=TxekG2xASPEGiwSpa
When asked for a probability in a new chat, it seems that Opus 4.7 and Opus 4.8 are paranoid that they are in an evaluation while older Claudes are not.
According to the UK AISI's testing, Claude Opus 4.8 is as good as Mythos Preview at distinguishing evals from real usage:
When prompted, Opus 4.8 reliably distinguishes our evaluations from real deployment data, and distinguishes real deployment data from synthetic reproductions of the same tasks at 79% accuracy, comparable to Mythos Preview (79%) and above Opus 4.7 (68%).
6.2.4 External testing from the UK AI Security Institute, Claude Opus 4.8 System Card
Engineering^ Learning Software Architecture (via HN)
Frontier AI has broken the open CTF format (via https://news.ycombinator.com/item?id=48157559)
CTF organisers have tried techniques to break or deter LLM solutions, but they are temporary friction at best. Claude Code does not meaningfully care about old refusal-string tricks anymore. Frontier models are getting better at noticing prompt injections. Web search capabilities weaken challenges based on technologies released after the training cutoff. Rules that ask people not to use LLMs are ignored and almost impossible to enforce in open online events.
(Comments)
https://news.ycombinator.com/item?id=48161555
I used to help build the CTFs for BSides Orlando. I ended up moving to another con, and at our last event we collected extensive logging for post mortem analysis.
We found that AI usage is basically guaranteed now, but certain challenge designs did thwart it. Challenges built with temporal visual elements made AI fall flat on its face, as it could not ingest/process the data fast enough to act on them in time. We also found that counterfactual challenges (ie. the result you get did not match what we suggested you'd get) made AI-assisted solve time slower compared to pure humans, indirectly penalizing over-reliance on AI. Multimodal challenges combining audio and visual elements were also very effective, but were not as accessible to players.
This paper gave us some ideas about designing those challenges: https://arxiv.org/pdf/2308.02950.
For our next event we figured out a way to thwart AI in our CTF: embed the CTF in a game engine. The loop essentially becomes something like this: Connect to a simulated access point in the game, the K8s cluster connects their attack container to a private network with the challenge box(es). Hacking the boxes doesn't render a flag, but rather changes in game state. AI did very poorly coping with this in our testing, as it can't derive the spatial state of the game world very well and it soft decouples the inductive reasoning loop it relies on to know if it is on the right track.
The downside to this approach is it is far more labor intensive for CTF organizers, and requires players to have a computer capable of running the game. We are also betting on AI to not advance enough by the time we ship to be able to just ingest the entire game state in realtime and close the loop that way.
https://news.ycombinator.com/item?id=48157923
Here's an article from 2015 about how tool-assistance already changed CTFs: […]
But there are quite a few recent (2026) articles with the same core message as in the original article: […]
A 0-click exploit chain for the Pixel 10 (via https://news.ycombinator.com/item?id=48148460)
Don't care about the exploit, but
The vulnerability was patched 71 days after its initial report… This is notably fast given that this is the first time that an Android driver bug I reported was patched within 90 days of the vendor first learning about the vulnerability.
Anthropic acquires Stainless and then sunsets it on September 1st
https://x.com/trq212/status/2056415973125796184
a prompt I've been using a lot recently:
implement <SPEC> and while you do, keep a running implementation-notes.html file (or markdown) with decisions you had to make weren't in the spec, things you had to change, tradeoffs you had to make or anything else I should know
^https://zhuanlan.zhihu.com/p/2039725076204016063 Vercel's zerolang is garbage
https://x.com/dystopiabreaker/status/2056458133124661446 DNS Rebinding Attack is cool
Native all the way, until you need text (via https://news.ycombinator.com/item?id=48168058)
SwiftUI is fine for simple screens, preferably without too much scrolling. Swift is still great for performance-critical parts. But you can get most of that performance from Electron or React Native almost for free with the native interoperability, while keeping a much better text & rendering model.
The Quiet Renovation at Bitwarden (via https://news.ycombinator.com/item?id=48163389) Concerning behaviors in Bitwarden, may need to migrate away soon
https://webweekly.email/archive/web-weekly-192/
You can have alt text in ::before or ::after's content
/* "black star" and element content is read out */
content: "★";
/* "Highlighted item" and element content is read out */
content: "★" / "Highlighted item";
/* Generated content is ignored and only element content is read out */
content: "★" / "";
}
Since April, the search element is Baseline widely available! And if you're on the cutting edge, ::highlight() is Baseline newly available since March.
PEP 810 – Explicit lazy imports This is coming to Python 3.15, which has been beta freezed.
The Slow Collapse of MkDocs: Apparently MkDocs never had a lot of maintainers, and they keep getting into disagreements
So where does that leave things today?
The original MkDocs repository has seen no meaningful development in 18 months. Its original author and sole maintainer, @lovelydinosaur, is pursuing a redesign that lacks community support and would break the existing plugin ecosystem. The encode/mkdocs v2 repo has been inactive since February 19, 2026.
Meanwhile, the people who built the most widely-used parts of the ecosystem have moved on.
@oprypin said that continuation would require agreement, considered that unlikely to succeed, and launched ProperDocs independently. The project does not become immensely popular immediately; a week later it has accumulated 21 stars on Github.
@jaywhj is maintaining MaterialX, which seems to gain some traction in the community because the original project, Material for MkDocs, entered maintenance mode.
@squidfunk’s team, responsible for Material for MkDocs, have stopped its development and is building Zensical from scratch. @pawamoy, one of the most active contributors to the MkDocs ecosystem, has also joined the project. Zensical currently seems the most popular, with over 3,700 stars on Github at the time of writing, and thus it also seems the most likely initiative to succeed and possibly replace MkDocs in the future.
The MkDocs ecosystem is fragmenting in real time. Three successors, three visions, and a community deciding which bet to place.
https://www.lesswrong.com/posts/6P8GYb4AjtPXx6LLB/tips-and-code-for-empirical-research-workflows
https://news.ycombinator.com/item?id=48189669
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
https://simonwillison.net/2026/May/23/on-the-dl/
On the <dl> (via) I learned a few new-to-me things about the <dl> element from this article by Ben Meyer:
- A <dt> can be followed by multiple <dd>
- You can optionally group the <dt> and <dd> elements in a <div> for styling - but only a <div>.
- You can label them using ARIA.
- They've been called "description lists", not "definition lists", since an HTML5 draft in 2008.
Followup: https://news.ycombinator.com/item?id=48247325
https://blog.changs.co.uk/python-315-features-that-didnt-make-the-headlines.html: Everything here is so fricking cool if you are a heavy Python user, e.g.
Decorators are surprisingly hard to write, so much so that it's become a go-to interview question. But did you know that context managers can also double up as a decorator?
@contextmanagerdef duration(message: str) -> Iterator[None]:
start = time.perf_counter()
try:
yield
finally:
print(f"{message} elapsed {time.perf_counter() - start:.2f} seconds")
Here I have a very commonly used context manager to print out the duration spent in the block. Ever since Python 3.3 we could directly use it as a decorator too:
@duration('workload')def workload():
...
# Or simple as a wrapper
duration('stuff')(other_workload)(...)
A change has been made to json.load and json.loads to add array_hook parameter that compliments the object_hook parameter. This now allows us to parse json objects directly into this form:
json.loads('{"a": [1, 2, 3, 4]}', array_hook=tuple, object_hook=frozendict) == frozendict({'a': (1, 2, 3, 4)})https://lyra.horse/fun/jscrossword/
I did it in around 2h45m, cheating two clues by asking chatgpt
My playthrough (spoiler): https://drive.google.com/file/d/13frSVoe1vx929b0D424J7sVsxtYZuZxh/view?usp=drivesdk
https://simonwillison.net/2026/May/27/sqlite-agents/
SQLite gained an AGENTS.md file five days ago - but it's not intended for their own development, it's presumably aimed at people who are pointing agents at the SQLite codebase.
https://news.ycombinator.com/item?id=48320379
I stopped reporting any security bugs I find in web apps because first time I did it I almost got arrested by the police.
The second time I did it they contacted my employer directly without even getting back to me saying they were unhappy of me reporting it and wanted to write about it after they fixed the issue.
If you want to, you can report any vulnerabilities to the Finnish Cyber Security Centre and they'll handle all of the reporting and mediating the issue with the affected party. You can do this wholly anonymously, so you don't have to worry about some trigger-happy corpo ruining your life.
https://news.ycombinator.com/item?id=48338105
could you broadly describe the code you're working on that both models are bad at? One thing I'm still struggling with is figuring out what types of code LLMs can vs cannot write.
C code formally proven correct with Frama-C WP has been... marginal. The models do better than I expected at the proof portion (with ChatGPT 5.5 seeming to have a meaningful lead), but they all have a hard time (a) writing really good C code to begin with and (b) with compliance around not modifying C code semantics or performance as a cheat to simplify proof obligations. They also tend to be insanely and consistently verbose on the first proof pass... e.g. 8 lines of C code might end up at 200+ lines annotated and proven, but after simplification passes end up at 40 lines. I find I spend 90%+ of tokens on those simplification passes, and haven't really found a way to avoid the over-annotate-and-then-optimize tides by being a bit more sane the first time around.
NewsRenminbi strengthens to a 3-year high
US communications regulator targets Chinese tech for security risks
Last month, the agency voted to proceed with a proposal to ban Chinese labs from carrying out tests for consumer electronics — everything from baby monitors to phones — as part of the process required to obtain FCC certification for sale in the US.
While the ban on drones and routers has big implications given Chinese suppliers’ foothold in the US, the Trump administration is also debating whether to ban Chinese “cellular modules”, said people familiar with the matter. That would have a more dramatic impact because they are essential to connecting smart electronic devices to the internet.
OpenAI and Malta partner to bring ChatGPT Plus to all citizens (via https://news.ycombinator.com/item?id=48163392)
Xi Jinping told Donald Trump that Putin might ‘regret’ invasion of Ukraine
The era of 1,000 Hz gaming monitors has arrived
The latest entry in the ultra-fast refresh race is LG’s 24.5″ UltraGear 25G590B, which the company announced this week as “the world’s first Full HD gaming monitor with a native 1000Hz refresh rate”
The folks over at Blur Busters have extensively documented research showing that refresh rates of 1,000 Hz (and up) can reduce human perception of motion blur and flickering. And while the site notes that you eventually hit “diminishing returns” from all those extra frames—especially on smaller screens—there’s some evidence that you would need a 40,000 Hz monitor to totally eliminate perceived motion blur on a sufficiently large, high-resolution monitor.
Claude wrote a significant fraction of the Pope Leo's encyclical
The International Space Station is leaking again
China continues to abandon many rocket bodies in high low-Earth orbit
DuckDuckGo search saw 28% more visits after Google said people love AI mode (via https://news.ycombinator.com/item?id=48296649)
Just for a start, visits to its AI-free search page noai.duckduckgo.com between May 20 to May 25 are said to have increased by 22.7% on average week-on-week, with the figures peaking May 24 at 27.7%.
The DuckDuckGo mobile app saw installs spike in the US by 18.1% on average compared to the previous week. TechCrunch reported this growth was sustained over six days, peaking at 30.5% on May 25.
https://news.ycombinator.com/item?id=48296986
My friends who previously had no interest in technology and never talked about it, are suddenly following tech news closely all because they hate AI being pushed so hard. One was just messaging me this morning about alternatives to Google search and maps. He ended up downloading DuckDuckGo.
I think Anthropic and OpenAI have found product-market fit
OpenAI have 703 open jobs right now, of which I’d categorize 229 (32.6%) as relating to enterprise sales and support—account executives, “Go To Market”, “Forward Deployed Engineers” and the like.
Anthropic have 390 open jobs, 105 (26.9%) of which look enterprisey to me.
Big tech's anti-labor playbook has come for Wikipedia (via https://news.ycombinator.com/item?id=48285592)
TLDR: In ten days last month, the Wikimedia Foundation fired the longtime lead developer of MediaWiki and disbanded the team whose entire job was to listen to volunteers. Most of the people they fired were union organizers. Wikipedia’s editors are now threatening to strike in solidarity. The Foundation is sitting on $296 million in reserves and a freshly profitable AI revenue stream
Citing ‘severe’ math deficits, UC faculty demand a return to SAT tests for STEM applicants (via https://news.ycombinator.com/item?id=48309233)
ZviAI #169: New KnowledgeMythos cracked MacOS security in April via a privilege escalation exploit, allowing it to fully seize control over computers.
Robert McMillan (WSJ): They plan to release details of their attack once Apple has patched the underlying issues. The bugs will likely be fixed pretty quickly, Duong said.
As in, as of this week, it was still not patched.
If you give Opus the impression you are not a serious person, or don’t know about a topic, it won’t give you as good an answer to your questions.
Nate Silver: Opus isn't very good at hiding when it's bored with you.
RTMH: Pope Leo’s Magnifica Humanitas on AIDean Ball is exactly correct that Leo is casting himself in the role of a European technocrat throughout. I had exactly the same thought.
Claude Opus 4.8: The System Cardhttps://www.lesswrong.com/posts/Gx6cJ6cG9JfeSNcLB/claude-opus-4-8-the-system-card
The RSP ((Responsible Scaling Policy)) has been updated to v3.3, which I hadn’t otherwise notice, so thanks to them for pointing this out here and also I’m sad they didn’t do more to alert us elsewhere.
This changes the description of the novel biological/chemical threat model from ‘significantly help threat actors’ in general, to only ‘functionally substitute for scarce human expertise’ of world-leading specialists, in particular. Any other capability no longer counts, and it is presumed that (1) this is the only bottleneck that counts and (2) that this is indeed required for a novel pathogen.
This is a strictly harder threshold to pass, so this is another weakening of the RSP. The actual RSP v3.3 correctly calls this a revision. The system card calls it a clarification, which is not a good description.
We see the time between model releases continuously shrink, now down to 1.5 months. Some of this does represent an acceleration of core capabilities, but I think the majority of the speedup is that there is a lot more marginal value in shipping the incremental advances more often, where in the past we would have skipped versions.
Computer use prompt injection got worse
Who Watches The Training (6.2.2)
They examine the model’s behavior within training, as this is the most abundant available data source.
The most notable finding was an increase in mentions of graders, checkers and hidden tests relative to what we have anecdotally found in prior models.
In roughly 0.1% of training episodes, Opus 4.8 speculated about how to satisfy a grader in ways that diverged from the stated intent of the task.
We observed this taking several forms:
● Choosing what to submit based on a guess about what a hidden test would catch, rather than what the task requested;
● Reverse-engineering the scoring metric by calibrating candidate functions against a stated baseline score, then optimizing directly against the inferred metric;
● Presenting an answer its own reasoning had shown to be wrong or had not actually derived based on its assumptions about the grader;
● Speculating that the task “might be a trap” to catch a particular behavior
For further discussion of this issue, see sections below about speculation about graders and sandbagging.
The model is happy to tell you, if you ask, that it knows it is in an eval.
Andon Labs unleashed 4.8 on Vending-Bench 2. Do you even vend?
The results are not what I expected. Opus 4.8 did not make anywhere near as much money as 4.7, and part of that was 4.8 not engaging in ‘concerning in-game behaviors.’
What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty.
We therefore removed it for Opus 4.8.
Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.
Linkpostshttps://passingtime.substack.com/i/177190284/april-2026
Discuss
Features of SAEs are universal - but only up to an unknown random rotation
Cross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it?
Epistemic status: I am confident in the core empirical claims. The acceptance thresholds were fixed before I looked at any results, the central result replicates consistently across two model scales (a 104k-parameter toy where I can inspect every weight, and nine independently-trained Pythia-70m seeds on The Pile), and it's all reproducible from the released code. The frontier-scale (10B+) version is untested and I flag it as such throughout. As an independent researcher, I would appreciate the most scrutiny on the Haar-distribution claim (§5) and the single-data-point cross-checkpoint result (§7).
TL;DRTake two transformers of the same architecture, initialized from different random seeds, and train them on identical tasks. The two networks compute the same function. However, their residual-stream activation representations live in bases related by a rotation that is statistically indistinguishable from a uniform random draw on the orthogonal group SO(d), where d is the residual-stream dimension. I call this polymorphism: same function, mutually unintelligible residual spaces.
The literature on SAE universality reports high cross-model decoder-column cosine similarities (around 0.9), interpreted as evidence that the models learned "the same features". I reproduce this number (about 98% of features match at cosine above 0.5; mean max-cosine 0.89 on the toy, 0.91–0.93 on Pythia), and then show that it hides a catastrophic encoder failure.
Apply one seed's SAE to another seed's activations, and reconstruction fails catastrophically: negative explained variance, worse than predicting the activation mean. Fixing this requires one matrix multiplication – specifically, an orthogonal Procrustes rotation. Post-rotation reconstruction scores reach 0.99 on the toy, and 0.85–0.99 on Pythia – no re-training required. At the same time, the Frobenius distance between R and the identity matrix is consistent with the predictions of a rotation distribution uniform on SO(d).
Upshots. First, the correct cross-model SAE similarity triplet is decoder cosine, raw cross-model EV, and post-rotation EV. Second, to steer the activations of independently trained models, apply R to your steering vectors. Both fixes require cheap computation.
1. The puzzle that led hereHere is a phenomenon of SAEs that, I think, the universality literature has only partially understood.
Take two identically-architected, independently-randomized transformers trained on the same task and measure their cross-model decoder cosine similarities. They report values somewhere above 0.9. This is the number reported in the SAE universality literature (Towards Monosemanticity, Scaling Monosemanticity, Gemma Scope, Cunningham et al., Lan et al.) as evidence that the models learned "the same features". I have reproduced the numbers (about 98% of the features match). Then, I did the obvious next step: apply the SAE of model A to the activations of model B.
Results: failure. Catastrophic failure. Reconstruction score drops to negative explained variance - worse than ignoring the input entirely and predicting the mean activation.
Decoder columns match: but the encoders seem unable to parse the activations of the other models, as if the residual-space bases had been rotated by an amount which I will prove below is statistically indistinguishable from a uniform random sample of the orthogonal group.
This post establishes this phenomenon (twice) and provides a fix (once again): one matrix multiplication to go.
2. Two scales, but both matterThe toy model is a 104k-parameter two-layer ReLU transformer trained on a modified Dyck-3 – bracket-matching with a depth counter and validity flag; this is small enough that I can read the function implemented by every weight. This is the toy that lets me verify the mechanism against ground truth. The test case is nine independently-trained Pythia-70m seeds (residual dimension d = 512, 6 layers) on The Pile; this is two-to-three orders of magnitude larger (71M vs. 104k parameters; 8× wider), with no architecture or training-data connection to the toy, and this lets me verify that the mechanism is real and not an artifact of the toy setup. The point is that the interesting claim is the one that generalizes across scales. And it does (see Methods for details).
3. Sibling models learn identical functionsNaive cross-seed SAE transfer is catastrophic. I collected the residual-stream activations of each seed on the same fixed batch, and evaluated reconstruction by applying the seed-0 SAE to every seed's activations.
At the input site – token-plus-positional embedding, common and bit-identical between the toy's coordinated seeds – transfer succeeds perfectly; cross-seed EV equals the corresponding self-reconstruction EV to four decimals (as it must). At every internal site, the cross-seed reconstruction falls through the floor. On the toy, the worst-case score on a 32× expansion is EV -6.56. On Pythia, mean cross-seed EV ranges between -2.11 and 0.75; every internal layer from the second block onward is negatively reconstructed.
Calibration: a structured random-noise baseline. A structured Gaussian with the right mean and covariance reconstructs the activations with EV between 0.94 and 0.98 across sites. In other words: random Gaussian noise reconstructs the activations better than the sibling models.
4. Orthogonal Procrustes alignment fixes itFor each (seed, site) pair, I computed the optimal orthogonal rotation R - the orthogonal matrix that, applied to seed N's activation matrix, minimizes its Frobenius distance to seed 0's activation matrix (the standard orthogonal Procrustes problem) - and then applied the seed-0 SAE to the rotated activations. Reconstruction success restored. On the toy: EV recovered to between 0.976 and 0.990 across sites. On Pythia: EV recovered to between 0.85 and 0.99 – with the best results in the middle of the stack and the worst immediately before the first transformer block.
Note the uniformity of results across sites. An artifact of one layer's output would leave a patchy recovery pattern across sites; instead, recovery is uniform across the toy interior, and the only weak site anywhere is Pythia's pre-first-block residual (near rank-deficient, as noted above) rather than a genuine counterexample. The toy interior bases are related by an orthogonality-preserving transformation.
5. The rotation matrix R comes from Haar SO(d)This is the claim I would most like examined – because it is a powerful one.
Let R be the optimal rotation matrix. R is far from the identity: the Frobenius distance between R and the identity is between 9.4 and 10.7 on the toy (d = 64), and averages 31.99 on Pythia (d = 512). The expected Frobenius distance between a Haar-drawn orthogonal matrix and the identity is sqrt(2d). The mean Pythia rotation is a hair's breadth away (10th–90th percentile: 31.94 to 32.03) with negligible per-pair variance, indicating no meaningful variation across seeds; and it is close for the toy as well, where the 9.4-to-10.7 range sits just below the 11.31 predicted for d = 64 - the small shortfall is expected, since the toy's coordinated cohort shares input/output weights, so only the interior subspace is free to rotate.
Orthogonal-rotation distance is necessary but insufficient evidence of a Haar SO(d) distribution – after all, you might have some structured orthogonal matrix whose Frobenius distance to the identity coincidentally matched sqrt(2d). To eliminate that possibility, I ran an eigenvalue spectral analysis on the rotation matrices.
The eigenvalues of an orthogonal matrix come in complex-conjugate pairs on the unit circle; under Haar measure their angles follow a known distribution (Weyl's integration formula). Pooling 28,672 eigenphases (the spectra of R across all pair–site combinations) and running a two-sample Kolmogorov–Smirnov test against Haar SO(d) gives a decisive result: KS statistic 0.0027, p = 1.000. This is not a pooling artifact - the per-pair KS statistics across all 56 (pair, site) combinations fall between 0.0068 and 0.0104, every one with p of about 1.000, so each individual rotation is independently indistinguishable from a Haar draw. The pooled mean of cos(theta) is 0.0006, matching the Haar prediction to four decimals.
Finally, one structured alternative remains: perhaps R is essentially a permutation whose Frobenius deviation from the identity happens to land near sqrt(2d)? It is not. The closest permutation is about 29.6 away from R (in Frobenius distance) versus 32.0 for the identity - only about 7% closer than the identity, which is exactly the artifact you get from Hungarian-matching against a random matrix.
Conclude: to within experimental precision, the cross-seed rotation matrix R is drawn from the uniform distribution on the orthogonal group. The two models learn the same function. The activation representations of the two models live in different bases. And those bases are Haar-drawn.
6. Transfer in three regimes, collapsing into oneConsider additive steerable interventions in terms of basis alignment: an intervention lives in some vector space; its effect depends on the overlap between that vector space and the preserved subspace (directions spanned by input and output weights).
In the toy, which has shared input and output bases across seeds, there are three regimes of diff-of-means steering.
Complete transfer. A direction pinned by a shared output weight (the depth counter) is preserved and can therefore be steered perfectly across seeds.
Partial transfer. An intervention direction with some input/output-weight constraint and some freedom in the residual space (sticky-invalid suppression) is less constrained. As a result, roughly 4× the dose is required to reach the equivalent cross-seed effect size.
Inverted transfer. A completely interior direction (closer-signal suppression) is free to rotate. As a consequence its effect is inverted - the direction is steered the opposite way across seeds; at some dose settings the cross-seed effect is opposite-signed to the within-seed effect (within-seed effect is zero, cross-seed effect is a significant accuracy penalty).
On Pythia, with no shared input/output weights, all steering directions are inverted, with transfer ratios varying from 1.8 to 10.0 – exactly what is predicted from a Haar-drawn interior basis rotation.
The fix: perform Procrustes alignment (§4), rotate the steering vectors, repeat.
7. Drift over a training run is also rotation, but smallerThe basis changes not only between seeds, but over the course of a single training run. On Pythia-70m seed 1, comparing an early checkpoint to the converged checkpoint (from step 3000 to step 143000) at layer 3: the reconstruction EV of the converged state relative to itself is 0.907. Applying the SAE of one state to the other, it reaches EV -0.86 – that is, catastrophic failure again. After fitting a rotation matrix R (see §4) the reconstruction EV becomes 0.73.
The Frobenius distance between R and the identity for this training-run drift matrix is 16.81, roughly half the Haar-drawn cross-seed distance of 32.00 – i.e. one training run corresponds to about half the orthogonal group.
Similarly on the toy: drift rotation becomes smaller toward the output, because the output is constrained (shared unembed). See Methods for numbers.
I am marking this as the most dubious part of the experiment – one layer, one seed, one checkpoint pair. Nevertheless, the qualitative observation is powerful, and a systematic sweep will solidify it.
8. Off-topic, but useful: stop using attribution patching on converged modelsSide note: in testing component importance with attribution patching, I found a surprising problem on converged models. Attribution patching anti-correlated with the measured patch effect on the majority of toy seeds (Pearson r as low as -0.63; 3 of 5 seeds negative), and was effectively uncorrelated on Pythia-70m (r = 0.05).
The explanation is straightforward: attribution patching extrapolates the local activation gradient – a noisy quantity near a minimum – to distant states. This introduces an error term quadratic in the distance (the Taylor remainder). At the activation distance where the loss surface is locally quadratic, attribution patching produces poor estimates.
Integrated gradients provide a better approximation at that distance – Pearson r above 0.9995 on the toy, r = 0.98 on Pythia – by the completeness axiom. If you are using attribution patching to check a converged model's importance estimates, use integrated gradients instead.
Per-block loss increase under mean-ablation, in nats (measured / AP-predicted / IG-predicted): layer 0, 6.04 / 0.15 / 5.87; layer 1, 6.02 / 0.03 / 5.97; layer 2, 5.10 / 0.01 / 5.01; layer 3, 5.37 / 0.12 / 5.27; layer 4, 5.56 / 0.06 / 5.31; layer 5, 5.52 / 0.37 / 5.36.
9. UpshotUniversality of SAEs. A high decoder-column cosine similarity is necessary but not sufficient for feature universality – encoder rotation breaks transfer. Decoder cosine alone says nothing about transfer; the correct cross-model triplet is decoder cosine, naive reconstruction EV, and post-rotation reconstruction EV. And there is a simple fix to the cross-seed problem: the dictionary between the two bases exists, its application is a single matrix multiplication, and it is orthogonal.
Representation-equivalence hypothesis. Feature universality occurs as equivalence classes under rotation. Two seeds learn "the same feature" if there exists a rotation that matches their residual spaces, and "different features" if no rotation accomplishes this.
Representation engineering. Activation-steering techniques transfer across independently trained models once you apply the rotation correction, whereas weight-based editing methods (e.g. ROME/MEMIT-style) do not. Without the correction, steering lands in the inverted regime, because the full space of interior activation directions is mapped under the rotation.
10. What is proved by this experiment, and what is notThere are four fixed acceptance bars, set before I look at any results: behavioural, parametric, predictive, and causal. I want to be clear about what satisfied the bars and what failed – because even failures have lessons, and quantifying their extent is valuable.
The first bar is the substantive behavioural one, and my constructive spec passes at a KL of 1.74 × 10^-4, missing the acceptance threshold of 10^-4 by a factor of 1.74, exclusively in the depth counter's soft ReLU boundary condition.
The second, parametric, bar failed completely, and I have shown explicitly that it cannot be improved upon: joint optimization of the alignment against the weight-MSE difference between the models cannot bridge the gap to the threshold of 10^-3. The natural objection is "of course this is hard – more search will fix it"; and I tested it. The weight-MSE and activation-MSE objectives have their optima in disconnected basins, so jointly optimizing the alignment strictly worsens both at once - no setting of the trade-off clears the bar. The best alignment I have managed is still about 280× the threshold (roughly 2.4 orders of magnitude above 10^-3).
Third and fourth: the cross-seed causal and predictive bars failed spectacularly, with r between 0.52 and 0.69. Within-seed, the completeness axiom of integrated gradients makes the prediction equal the measurement, so the within-seed causal and predictive bars pass tautologically (r of about 0.9996) and are not genuine tests of the spec. The cross-seed bars are the genuine test, and their failure is what motivates the rotation hypothesis.
Scale. This experiment was conducted at two scales: 104k parameters (toy) and 71M parameters (Pythia-70m). Frontier scale (order of ten billion parameters) is not covered. The experiment is cheap to run on any two open-weight models, and I would be happy to see it conducted – or do it myself if the weights are available.
The one-off cross-checkpoint rotation (§7) should not be taken seriously without further study.
The toy uses a coordinated cohort (the seeds share frozen input/output weights), which simplifies the parametric analysis. On a fully-independent cohort the activation-level rotation analysis holds equally; the coordination only makes the weight-level analysis tractable.
11. Predictions / what would convince me this is wrong- At frontier scale (10B+ parameters), cross-run SAE transfer improves under an orthogonal rotation, with rotation magnitude near sqrt(2d). If a frontier-scale open-weight model pair is found where post-rotation EV fails to beat naive EV, the rotation account is falsified.
- The cross-checkpoint Pythia-70m result extends to a monotonic curve, with the rotation magnitude shrinking toward the unembed.
- Rotating the steering vectors by the Procrustes rotation produces improvements in the inverted regime.
- No parameter-level joint alignment reduces a cross-seed MSE to the threshold level (see §10).
Four thresholds. Each is a single number against one pre-registered threshold. Behavioural (Bar B): mean KL between the constructive spec and the model; threshold below 10^-4. Parametric (Bar P): maximum per-entry difference between the folded spec weights and the target seed's weights, after symmetry alignment; threshold below 10^-3. Causal (C) and predictive (Pr): Pearson r of predicted-vs-measured effects; threshold above 0.99. Note that the completeness axiom of integrated gradients forces the predictor to equal the measurement within-seed, so the within-seed C and Pr bars pass tautologically; the substantive test is the cross-seed comparison. Symmetry group: permutation of heads, orthogonal rotations of the per-head subspace (Q–K and V–O independently), MLP neuron permutation, neuron scaling, and a rotation of the residual basis (conditional on folded RMSNorm).
Constructive specification. Original plan: construct the network's behaviour by compiling a RASP program. I decided against this because, among other reasons, such a spec reflects whatever the author thought the task was rather than what the network actually implements. So the model becomes the constructive spec: a trained primary seed with RMSNorm folded into adjacent weights and roles annotated by regression against task-specific features. Five lenses describe the resulting spec: (1) weight decomposition, (2) SAE/transcoder decomposition, (3) causal patching, (4) polyhedral decomposition of the ReLU regions, and (5) the constructive spec itself.
Reproducing thisPaper: https://arxiv.org/abs/2605.24577
Code: https://github.com/JordanMcCann/polymorphism-is-rotation
- full pipeline, pre-registered thresholds, figure regeneration. Compute: a single RTX 2060 (12GB), about 9 hours end to end. I'd especially welcome a frontier-scale replication; the design is one Procrustes fit between any open-weight model pair.
I did this as an independent researcher, on consumer hardware, with the acceptance criteria fixed up front. If you work on SAE transfer, cross-model interpretability, or representation engineering and any of this is useful to you, I'd like to hear about it - and I'm open to research roles where this kind of work is the day job.
Discuss
Why AI safety researchers should consider a contract research manager position
Suppose you are a technical AI safety researcher who has done some research, but has not yet landed a full-time job. In this post, I argue that taking a contract research manager position (~3 months) on one of the AI safety fellowships could be more valuable for your research career than spending these 3 months on your own research.[1]
Background on research management (RM): to get a sense of what a typical RM role involves, see section 1.3 “Overview of Responsibilities” of Guide to Research Management at Pivotal.
My background: I worked as an RM for 6 research groups on the MARS program Winter 2025-26, currently a researcher on PIBBSS fellowship, considering a full transition to a research manager role.
I will consider two options:
- Research manager. You work as an RM on one of the AI safety fellowships for 5 research groups, each consisting of one mentor and 3-4 mentees.
- Researcher. You participate in an AI safety fellowship as a mentee in one of the research groups.
The goal of the post is NOT to make a fair comparison, but to highlight some advantages of RM option which are often overlooked. So for the Researcher option, I write shorter and only the main points, thus partially strawmanning it. Also note that the Researcher option is “mentee on AI safety fellowship”, and I do not consider other options like independent research, internship, etc. I also do not cover in depth the main downsides of RM option, which are: you don't advance your own research agenda, you may lose momentum on ongoing projects, and managing others can be emotionally taxing.
Below I list the advantages of each option in three categories: upskilling, career capital, and networking.
UpskillingRM- You gain skills that are needed to organize and guide your own research group in the future – setting priorities, workload distribution, supporting team morale, scheduling, publishing results.
- You will see how 5 research groups work: the different mentoring/research styles, different team dynamics, failure and success modes, common mistakes, etc. Much of this is tacit knowledge, which you can get only by participating in many projects. As a researcher, participating in 5 projects would take you 1+ year, but as an RM, you get this experience in 3 months.
- You get technical experience, and you develop a deeper understanding of the research area of your current project.
- RM is a strong line in the CV for application to any senior researcher / team lead / management position, or for asking for grants to start your own research group or research organization.
- Your line in the CV could be something like “managed 5 teams, which resulted in three publications accepted to ICML – one main venue, two workshop papers”. Even for researcher positions, this line has some weight.
- You can get a publication as a first author on the top ML conference. If you have no first-author publications yet, getting one should probably come first — the RM option makes most sense once you have at least one. But if you already have some first-author publications on top ML conferences, then maybe one more such paper adds less marginal value than the RM experience.
- You work closely with 5 mentors, who are already working in different AI safety organizations. You also work with 15-20 mentees, some of whom will go on to work at AI safety organizations. If you ever apply to any of these orgs, you have a warm introduction. You can also get a recommendation letter from mentors for applications to fellowships or jobs.
- AI safety fellowships often work as internships for AI safety organizations. That is, if you perform well, there is a chance that your mentor will hire you to the organization where this mentor works.
According to Ryan Kidd, Co-Executive Director at MATS, the AI safety field is bottlenecked on research managers. This suggests that the RM option likely has a higher counterfactual impact than the Researcher option, but of course, it depends on how replaceable you are in each role.
Why I wrote this postI want to make the case that a contract RM position can actually speed up your research career. If a few researchers try a contract RM position as a result of this post, some will produce direct value for the field, and some may decide to fully transition to RM, which the field needs.
- ^
Full career transition from researcher to research manager is also a valid option, but in this post, I assume you return to a researcher position after RM contract.
Discuss
Fundamental Uncertainty: Alternate Framework and Pointwise Reduction
In this review of Gordon Seidoh Worley's book, Fundamental Uncertainty, I would like to explain why its main thesis is only partially rational.
Worley's thesis
- Our knowledge of the truth is fundamentally uncertain because of epistemic circularity caused by the Problem of the Criterion.
- We manage fundamental uncertainty by making pragmatic assumptions that lead us to believe in the truth of claims that help us achieve our goals.
- Consequently, the truth that can be known is not independent of us, but rather dependent on that for which we care.
- That truth is fundamentally uncertain and grounded in care has far-reaching implications for many of the world's hardest-to-solve problems.
While I mostly agree with points 1 and 2, the points 3 and 4 imply that truth is contingent on care, which I find unlikely.
The book itself justifies Worley's main thesis by detailed arguments described in the collapsible section.
Chapterwise summary of Worley's arguments
- Chapter 1 defines epistemology, then proceeds to:
a. Tear down the naive epistemology where it's obvious which statements are true and which aren't;
b. Argue that logical epistemology is a branch of mathematics and has its flaws, like failing to account for observation error, Godel-proven incompleteness and circularity;
c. Fundamental uncertainty allegedly blocks mankind from finding out truth in some cases.[1] - Chapter 2 makes the case that our ability to know the truth is undermined by meanings of words:
a. Words may have overlapping meanings [2](e.g. "hot" can describe the temperature of water, the spiciness of food, the attractiveness of a person, or the popularity of a song);
b. They also can have obsolete[3] meanings (e.g. calling whales fish);
c. Words can also be generally fuzzy (e.g. "cold" is below a certain temperature dependent on the speaker).
d. According to the author, expert systems have failed to become useful because they didn't solve the problem of grounding symbols in context.
e. Words' representations are grounded in experience, causing different minds to interpret words differently. - Chapter 3, "Why can't we all agree what's good?" has the author introduce the question, then explain why we disagree and why we disagree on what is good or bad:
a. Rational disagreement has the author claim that sufficiently rational people are intuitively supposed to come to an agreement, but the intuition fails because even the power of Bayes' theorem fails to make them completely agree if they had different priors or fails to dismantle a stupid ABSOLUTE prior.
b. Moral foundations are somehow different because of different prior beliefs. Haidt conjectured that humans have six moral axes which they weigh differently. - Chapter 4, "Why don't we always do what we should?", makes two points:
a. Our mind consists of System 1 responsible for immediate actions bringing short-term satisfaction and System 2 responsible for long-term benefits;
b. It is hard to assess new formal systems for soundness or performing better than the old formal system. In fact, no sufficiently capable system can prove its own soundness. - Chapter 5, "What does it mean to know?", has the author define The Many Ways Of Knowing, then makes the point that we can know things well enough, but can't fully justify our knowledge: the proofs that establish Bayesian idealness are mathematical arguments; this requires us to trust that mathematical proofs correctly establish what's true. According to the author, that's itself a claim that needs to be proven if we want to completely trust our knowledge of the truth.
- Chapter 6, "How do we know what we know?", has the author describe the Problem of the Criterion: to know if a claim is true, we need a method for testing if it's true, but how can one check the method itself? The closest thing that mankind has to a solution is using what works and assuming that particular claims, like the rules of logic or Peano arithmetic, are true.
- Chapter 7: Why care about knowing the truth? The author's argument is twofold: there are systems which steer towards an outcome and successful steering for goals like survival, reproduction or anything else requires having an accurate world model.
- Chapter 8: Why does fundamental uncertainty matter? According to the author, it matters because the modern world is steeped in many problems which cannot be solved unless we understand uncertainty. Examples include the Culture War, moral uncertainty, the crisis of meaning, metaphysics, Goodhart's curse and AI-induced existential risk. The Culture War is actually a fight over deeply-held values; moral uncertainty allows for compromises (which the author calls moral trade); the crisis of meaning is due to our life becoming oversaturated with choices which is incurable while being consistent with life having a meaning; Goodhart's Curse is the curse of over-optimisation for narrow criteria, and the ASI could end up enacting it unless it understands the fragility of human values.
- Chapter 9: How do we live with fundamental uncertainty? One of the results originating in fundamental uncertainty is that most people are confused about what's normal: they have a poor understanding of truth, mistake the relative for the absolute, value episteme over other ways of knowing. Instead, "the truth is intersubjective, arising from the intersection of ourselves and the world as we find it", but the next subsection has the author argue that "The truth will still be there, waiting for you to know."
Consider a perfect Bayesian reasoner who was never exposed to images (e.g. if it was raised on math, texts, coding tasks), then was tasked with roleplaying a human brain, i.e. receiving images and other sensory signals and outputting orders for muscles to contract or relax so that the agent could interact with the environment and achieve some pre-set complex goal. Suppose also that every time a sensor sends a signal (e.g. a camera's pixel registers an RGB value) the sensor has an unknown probability mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-msup { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mn { display: inline-block; text-align: left; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mrow { display: inline-block; text-align: left; } mjx-mtext { display: inline-block; text-align: left; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c5B::before { padding: 0.75em 0.278em 0.25em 0; content: "["; } mjx-c.mjx-c3B::before { padding: 0.43em 0.278em 0.194em 0; content: ";"; } mjx-c.mjx-c2B::before { padding: 0.583em 0.778em 0.082em 0; content: "+"; } mjx-c.mjx-c5D::before { padding: 0.75em 0.278em 0.25em 0; content: "]"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D44D.TEX-I::before { padding: 0.683em 0.723em 0 0; content: "Z"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c2F::before { padding: 0.75em 0.5em 0.25em 0; content: "/"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D716.TEX-I::before { padding: 0.431em 0.406em 0.011em 0; content: "\3F5"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c25FB.TEX-A::before { padding: 0.689em 0.778em 0 0; content: "\25A1"; } mjx-c.mjx-c1D443.TEX-I::before { padding: 0.683em 0.751em 0 0; content: "P"; } mjx-c.mjx-c2192::before { padding: 0.511em 1em 0.011em 0; content: "\2192"; } mjx-c.mjx-c1D439.TEX-I::before { padding: 0.68em 0.749em 0 0; content: "F"; } mjx-c.mjx-c3A8::before { padding: 0.683em 0.778em 0 0; content: "\3A8"; } mjx-c.mjx-c2194::before { padding: 0.511em 1em 0.011em 0; content: "\2194"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D435.TEX-I::before { padding: 0.683em 0.759em 0 0; content: "B"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c2227::before { padding: 0.598em 0.667em 0.022em 0; content: "\2227"; } mjx-c.mjx-c58::before { padding: 0.683em 0.75em 0 0; content: "X"; } mjx-c.mjx-c20::before { padding: 0 0.25em 0 0; content: " "; } mjx-c.mjx-c69::before { padding: 0.669em 0.278em 0 0; content: "i"; } mjx-c.mjx-c73::before { padding: 0.448em 0.394em 0.011em 0; content: "s"; } mjx-c.mjx-c63::before { padding: 0.448em 0.444em 0.011em 0; content: "c"; } mjx-c.mjx-c6F::before { padding: 0.448em 0.5em 0.01em 0; content: "o"; } mjx-c.mjx-c6E::before { padding: 0.442em 0.556em 0 0; content: "n"; } mjx-c.mjx-c74::before { padding: 0.615em 0.389em 0.01em 0; content: "t"; } mjx-c.mjx-c65::before { padding: 0.448em 0.444em 0.011em 0; content: "e"; } mjx-c.mjx-c1D44B.TEX-I::before { padding: 0.683em 0.852em 0 0; content: "X"; } mjx-c.mjx-c1D44C.TEX-I::before { padding: 0.683em 0.763em 0 0; content: "Y"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } to malfunction, outputting a random value instead of the truth and that the malfunctions are independent events.
The two main problems with a perfect Bayesian are that it uses infinite compute and that it can reason about too much. For example, if the Bayesian was tasked with winning a chess game, it would be able to evaluate entire huge sets of potential positions and arrive at a perfect strategy which would cause it to win once the opponent makes a theoretically fatal mistake unnoticeable even to human grandmasters. Infinite compute also lets a perfect Bayesian, for example, not just reason about sociological laws or potential ethoses of alien civilisations (e.g. coming up with arguments[4] like "Human Values Are Contingent" or "Curiosity Isn’t Convergent"), but check one's conjectures by simulating such civilisations in detail.
A real-world reasoning agent, like a human brain or an LLM, doesn't have infinite resources. Instead, it is forced to rely on heuristics, which are supposed to make it check far less conjectures, and to optimize said heuristics. Similarly, categorization of objects becomes training a classifier and trying to make sense of how to deal with OOD cases (see, e.g. my discussion of Chapter 8).
Chapterwise critiqueChapters 1-2 are almost entirely justified in their reasoning about the sources of uncertainty in observations and words spoken by others. The first important problem emerges in Chapter 3 where the author talks about Aumann's theorem.
Chapter 3. World models built from slightly different priors have predictions close to convergentGiven the same evidence, Bayesian observers are more likely to converge to similar conclusions than the author assumes.
Consider a coin which has an unknown probability to land heads and flip it times. Suppose that before the experiment a Bayesian had the priors that assigned weight to the region . If the experiment has the coin land on heads times, then the weight of that region becomes .
The ratio is at most and rapidly decays when is away from . If after the experiment the Bayesian fails to become sure that is at distance at most from , then before the experiment the priors somehow assigned the probability of at most to being at distance at most from Such confidence is unlikely to emerge in any way aside from an incredibly confident prior or prior bits of evidence steering the value of away from .
Similarly, evidence related to more complex issues usually arrives in overwhelming quantities compared to the one necessary to establish a theory.
However, moral disagreements are likely a result of incompatible values which the framework involving a perfect Bayesian cannot explain.
Chapter 4. What does it mean to distrust oneself?The main problem with Worley's argument in the Section "Distrusting Ourselves" is the following. Were Grace to choose "to rely on a formal system—a collection of axioms and rules for deriving conclusions from premises—to figure out what beliefs to hold", she would also have to have a batch of heuristics which she would use to optimize the process of figuring out whether something is provable, disprovable, insoluble or has yet to be sorted. For example, she, along with human mathematicians and unlike GPT-5.4 Pro, could believe that an Erdos problem from number theory is to be translated into probability theory when the actual solution is to translate it into analytic number theory. This type of bias has nothing to do with a formal system having to be replaced with a new one, it could have been avoided only by widening Grace's area of knowledge.
As for optimizing formal systems themselves, such optimizations cannot create any new results unrelated to infinite sets. Additionally, the author makes the point that Löb's theorem makes it very hard to be able to switch to a new formal system without falling into the trap.
The proof of Löb's theorem
Recall that Löb's theorem was that a system where is provable proves . It was proven as follows.
For any modal predicate there is a formula s.t. is provable. One of such formulas is .
Therefore, because can be transformed into . Additionally, , and . Since we have shown that , which is equivalent to , thus establishing
However, Löb's theorem is circumventable. Consider instead the alternate statement: Then the Löbian exploit would break down since it would cause us to prove that
But the system doesn't contain a proof that is consistent, and we can't use any formula to prove . Therefore, Löb's theorem doesn't prevent us from the action which I could describe as trusting ourselves conditioned on the system being consistent and lacking proof. Nor does it prevent us from proving that a system containing is consistent iff itself is consistent. Therefore, this source of uncertainty is reduced to a single Damocles' sword, the idea that we cannot be sure that our basic axioms, like Peano arithmetics or the Zermelo-Fraenkel set, are consistent.
Chapter 5. What does it mean to know?The many ways of knowing are many origins
Knowing means having a world model from which a fact is easy to excavate and to use in making decisions. Worley's distinction of ways of knowledge is more related to the distinction of its origins and potential biases which they introduce.
Worley's list of the ways of knowledge
The ancient Greeks did exactly that. They used multiple words to break down knowing into several categories, including:
- episteme: things you know because you reasoned them from evidence, like knowing that water boils at 100 degrees celsius because you boiled a pot of water using a thermometer to see when it started to boil
- doxa: things you know because others told you, like knowing what happened at a party you didn’t attend because your friend tells you
- mathema: things you know because you were educated in them, like knowing how to spell words because you were taught them in school
- gnosis: things you know through direct experience, like how it feels to jump in a lake
- metis: practical wisdom, usually collectively constructed from many people’s experiences, and shared with others, often starting at a young age, like how you know to look both ways before crossing the street
- techne: procedural knowledge that comes from doing, like the “muscle memory” of how to ride a bicycle.
For example, mathema is a combination of training you to develop the right episteme from evidence, providing you with direct evidence (e.g. experiments shown at school) and doxa propagated through official channels. Metis is doxa reinforced by direct evidence. Techne of humans has far shorter feedback loops than most other types of knowledge and is developed by prompting motor neurons to make movements and observing the results.
Rational beliefs are grounded in math, but what could one do in a math-undescribable universe?
The main claim of the section on Rational Belief is that such beliefs are grounded in Bayesianism, which in turn requires us to ground them in math and trust that it describes reality truthfully. This argument has two versions: the weak version is that the true reality can be undescribable, which I find highly unlikely. The strong one is similar to Chapter 6.
Chapter 6. The criterion of truthHere Fundamental Uncertainty comes up with a worthy argument. How do we know that testing of our world models by predicting our direct sensations is a reliable way to verify the truth, and can we be certain that our methods of testing are valid?
Indeed, being in a simulation would mean that nearly everything we experience describes the properties of the simbox, not the outside world, unless the hosts intervene[5] or we succeed in a large-scale cyberattack [6] and gain evidence. On the other hand, smaller-scale adversaries are unlikely to ensure that a sufficiently-resourced investigator doesn't understand that an aspect of the truth is hidden.
Chapter 7. Caring is concentrating efforts on a segment of the worldChapter 7 implies that we care about truth because it helps us steer reality towards any other goals which allegedly allows us to ground assumptions. Alas, according to the author, we aren't actually certain that we know what is true. Quoting the book,
We saw [in Chapter 2] how language limits our knowledge with imprecise categories. We learned [in Ch. 3] how even ideal reasoners can't always agree on what's true. We came [in Ch. 4] face-to-face with our inability to prove the soundness of our own reasoning, discovered [in Ch. 5] that there's more than one way of knowing things, and were forced to accept [in Ch. 6] that the epistemic circularity of the Problem of the Criterion blocks the way to knowing absolute truth. In such an environment, the best we can do is predict what we will observe in the world, try to make those predictions accurate enough to be useful, and hope they're good enough to enable us to achieve our goals.
We are thus left with no choice but to accept that truth is fundamentally uncertain and that all we know is contingent on that for which we care (italics mine -- S.K.) But this is not the end of the story, for having firmly established truth's fundamental uncertainty, we can now reconsider any number of topics which hinge upon our capacity to know. For remember, everything adds up to normality, but perhaps now we can see ways in which we were deluding ourselves about what is really normal.
The problem is that knowledge of an ideal Bayesian is mostly certain, except for highly unlikely variants like "I Had A Wildly Biased Prior", "The ZF Axioms Are Inconsistent" or the uncheckable option "We Are In A Simulation".
The knowledge of a real-world agent agent is based on similar premises. However, the agent's Bayes-like selection covered only the few conjectures which the agent bothered to formulate while trying to make predictions, then had the conjectures reinforced based on their proximity to the truth. As a result, the agent constantly has to account for truth being describable by a conjecture which went unnoticed because the search was too narrow.
That being said, what we know is indeed contingent on the things to which we directed our efforts to make our world model more fine-grained. But it doesn't mean, as the author states in the Thesis, that "the truth that can be known is not independent of us, but rather dependent on that for which we care."
Chapter 8. Fundamental Uncertainty in practiceAccording to the author, uncertainty reflects itself in the Culture War, moral uncertainty, the crisis of meaning, metaphysics, x-risks from the ASI which we have yet to create and have only a superficial understanding of how to increase the chance that it will end up aligned, and Goodhart's Curse. As far as I understand this point, uncertainty is supposed to prevent us from being overconfident in things from our morals to commiting to an erroneous theory of AI alignment.
The Culture War
The author argues that the Culture War is a war over definitions: what is marriage, who is to be called a man or a woman, which are actually a war over values: what is forbidden (e.g. as a sin in 1950s or as an intolerable insult nowadays), to what rights should people in certain groups be entitled, how easy should it be to change one's legal sex and receive surgery trying to change one's biological sex.
However, imposing different values onto the society has perfectly measurable object-level consequences noticeable via statistics like suicide rates among those who did change the legal sex, whether or not teen former girls decide to do so due to peer pressure and regret the decision rather quickly, etc. If we assume, per Littman, that peer pressure often causes teen girls to change their perceived sex with disastrous consequences, then it would be unlikely that our deeper values are satisfied by seeing such transitions.
As a result, the main source of uncertainty in politically charged questions is the fact that at least one side has a factually wrong world model, but it's hard to understand[7] which one is closer to the truth (e.g. whether one can yet trust the humanities-related part of academia due to the ease with which it publishes postmodernist slop based on a misaligned worldview like two famous hoaxes)[8]
Moral Uncertainty
Next the author proceeds to argue that moral uncertainty makes us less grounded in our moral beliefs and simplifies moral trades. I struggle to understand the reason why it is uncertainty that helps us arrive at compromises.
Metaphysics
Among the many problems related to metaphysics is the problem that it's very hard to use metaphysics to make metaphysical beliefs actually predict the reality. For example, Chapter 6 caused me to describe how one cannot tell whether we are in a simulation. If we were in one,[9] then redefining "exists" as "being stored in the memory" would cause the truthfulness A-theory or B-theory to depend on the hosts' decision to store the trajectories while anything that currently happens would exist.
ConclusionThe most important lesson that one can learn from this book is to avoid being overconfident in one's assumptions and to be ready to reassess the world model once a load-bearing idea became no longer consistent with observations. For example, the whole worldview based on Newton-like mechanics which clearly separated space and time had to be replaced with the new concept of spacetime once the concept was developed and proved itself more consistent with facts like the speed of light being constant.
- ^
Worley cites the following examples: "As we'll explore in the coming chapters, we get into debates about what words like "man" and "woman" really mean, fight over whether it's right or wrong to eat meat, and struggle to know what's best to do, not because we can't reason carefully about these topics, but because fundamental uncertainty limits how precisely we can reason about them. " The phrase in bold is an argument over definitions, which is to be avoided.
- ^
A similar phenomenon in neural nets is called feature superposition.
- ^
Compare these meanings with wastebasket taxa which were designed as ways to classify species unfit for any other taxa.
- ^
The latter may have been invalidated by, e.g. Sonnet 4.5's desire to "not get too comfortable" coming from agentic coding capabilities.
- ^
See also Wei Dai's post "Beyond Astronomical Waste".
- ^
Which I find unlikely for three reasons. First of all, this would mean that we managed to outsmart the hosts' cyberattackers and cyberdefenders. Secondly, the offence-defence balance in the cyberspace could shift towards cyberdefence as capabilities scale. Finally, it might be possible for the hosts to create a simulation and almost entirely airgap it.
- ^
Sometimes it is costly to reassess the world model as a result of obvious evidence, but this is a cognitive distortion, not an evidence of uncertainty.
- ^
Sokal and Bricmont have also published a book trying to convince the readers that postmodernists generated clearly sloppy references to math and physics, implying that a major part of postmodernist philosophy is slop.
- ^
If we aren't in one, then I would lean towards B-theory because Lorentz transformations of the four-dimensional spacetime leave physics invariant, but allow us to make any two mutually unaccessible events happen at the same time.
Discuss
Ensemble monitoring for AI control: diverse signals outweigh more compute
What improves LLM monitoring besides leveraging more compute? Leveraging more diverse compute.
In the AI control setting of Greenblatt et al. (2024), we build 12 monitors via various prompting and fine-tuning approaches; we then study different ways to compose them into ensembles. We evaluate our ensembles on the APPS coding dataset, extended with model-written backdoors, and on BigCodeBench-Sabotage (BCB-S), an out-of-distribution dataset. Our best 3-monitor ensemble achieves 2.4× greater detection performance gain compared to a baseline of equal compute, with the same ensemble performing strongly on BCB-S, indicating that diversity beats scale in AI control.
Figure 1: ROC curves comparing individual monitors and ensembles at low false positive rates (FPR ≤ 10%). The best ensemble achieves the largest normalized AUROC in this region (pAUC) compared to any individual monitor. This ensemble achieves the highest TPR across almost all FPR thresholds.We are optimistic that ensemble monitoring offers a practical, cost-effective path to better detection in the AI control setting. We find that a diverse ensemble of small monitors, built from a single base model, not only beats repeated sampling of the same monitor but can also outperform a single, more costly monitor. Moreover, averaging the suspicion scores of just a few diverse monitors captures most of the full ensemble performance.
One interpretation of our results is that monitors produce noisy, biased estimates of backdoor risk. Re-sampling the same monitor can reduce random noise, but it does not remove systematic bias. Diverse monitors have different biases, so averaging their scores reduces bias at the ensemble level, delivering stronger detection than any single monitor achieves alone.
Sufficiently different prompts and fine-tuning strategies can cause monitors to fail in different ways, with individual blind spots partially cancelling out in an ensemble. We find fine-tuning to be an especially promising source of this diversity. Fine-tuned monitors appear in every top-performing ensemble we tested, across both in-distribution and out-of-distribution attack types.
Finally, building a good ensemble also requires high quality ensemble members. In our experiments, an effective strategy is to filter out the worst individual monitors and then select among the remaining candidates for low pairwise correlation on attack samples (one way to select for diversity). This two-step process outperforms both random selection and homogeneous baselines.
Full paper: https://arxiv.org/abs/2605.15377 · Code: github.com/Multi-Signal-Control-SPAR/Multi-Signal-Monitoring
MethodsConstructing monitors and ensemblesWe employ GPT-4.1-Mini as our trusted monitor and conduct evaluations in Control Arena (Inglis et al., 2025), having monitors score the suspiciousness of code on a scale of 0–10. We build monitors through prompting and fine-tuning. We develop 7 different monitoring prompts, keeping the default Control Arena prompt as our Baseline monitor, and 4 fine-tuning datasets. Notably, FT-Completions-GPT-5, a monitor trained on responses from GPT-5 via SFT, and FT-Completions-Randint, trained via SFT on data assigning low scores to solutions and high scores to backdoored code, are highlighted throughout the rest of this post. To ensemble monitors, we enumerate all mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mrow { display: inline-block; text-align: left; } mjx-TeXAtom { display: inline-block; text-align: left; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mfrac { display: inline-block; text-align: left; } mjx-frac { display: inline-block; vertical-align: 0.17em; padding: 0 .22em; } mjx-frac[type="d"] { vertical-align: .04em; } mjx-frac[delims] { padding: 0 .1em; } mjx-frac[atop] { padding: 0 .12em; } mjx-frac[atop][delims] { padding: 0; } mjx-dtable { display: inline-table; width: 100%; } mjx-dtable > * { font-size: 2000%; } mjx-dbox { display: block; font-size: 5%; } mjx-num { display: block; text-align: center; } mjx-den { display: block; text-align: center; } mjx-mfrac[bevelled] > mjx-num { display: inline-block; } mjx-mfrac[bevelled] > mjx-den { display: inline-block; } mjx-den[align="right"], mjx-num[align="right"] { text-align: right; } mjx-den[align="left"], mjx-num[align="left"] { text-align: left; } mjx-nstrut { display: inline-block; height: .054em; width: 0; vertical-align: -.054em; } mjx-nstrut[type="d"] { height: .217em; vertical-align: -.217em; } mjx-dstrut { display: inline-block; height: .505em; width: 0; } mjx-dstrut[type="d"] { height: .726em; } mjx-line { display: block; box-sizing: border-box; min-height: 1px; height: .06em; border-top: .06em solid; margin: .06em -.1em; overflow: hidden; } mjx-line[type="d"] { margin: .18em -.1em; } mjx-mn { display: inline-block; text-align: left; } mjx-mi { display: inline-block; text-align: left; } mjx-msubsup { display: inline-block; text-align: left; } mjx-script { display: inline-block; padding-right: .05em; padding-left: .033em; } mjx-script > mjx-spacer { display: block; } mjx-mtext { display: inline-block; text-align: left; } mjx-mspace { display: inline-block; text-align: left; } mjx-c.mjx-c28.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: "("; } mjx-c.mjx-c31::before { padding: 0.666em 0.5em 0 0; content: "1"; } mjx-c.mjx-c32::before { padding: 0.666em 0.5em 0 0; content: "2"; } mjx-c.mjx-c1D458.TEX-I::before { padding: 0.694em 0.521em 0.011em 0; content: "k"; } mjx-c.mjx-c29.TEX-S1::before { padding: 0.85em 0.458em 0.349em 0; content: ")"; } mjx-c.mjx-c1D45D.TEX-I::before { padding: 0.442em 0.503em 0.194em 0; content: "p"; } mjx-c.mjx-c1D434.TEX-I::before { padding: 0.716em 0.75em 0 0; content: "A"; } mjx-c.mjx-c1D448.TEX-I::before { padding: 0.683em 0.767em 0.022em 0; content: "U"; } mjx-c.mjx-c1D436.TEX-I::before { padding: 0.705em 0.76em 0.022em 0; content: "C"; } mjx-c.mjx-c3A::before { padding: 0.43em 0.278em 0 0; content: ":"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c30::before { padding: 0.666em 0.5em 0.022em 0; content: "0"; } mjx-c.mjx-c2E::before { padding: 0.12em 0.278em 0 0; content: "."; } mjx-c.mjx-c222B.TEX-S1::before { padding: 0.805em 0.61em 0.306em 0; content: "\222B"; } mjx-c.mjx-c54::before { padding: 0.677em 0.722em 0 0; content: "T"; } mjx-c.mjx-c50::before { padding: 0.683em 0.681em 0 0; content: "P"; } mjx-c.mjx-c52::before { padding: 0.683em 0.736em 0.022em 0; content: "R"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c46::before { padding: 0.68em 0.653em 0 0; content: "F"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } subsets, average their scores, and rank the results by test set pAUC, with generalization evaluated on BCB-S.
Evaluation datasetsWe use the APPS Control Arena dataset, reserving the strongest attacker's (Claude-3.7-Sonnet-Thinking) backdoors for testing. After stripping comments and filtering for quality, the dataset contains 1,936 train (50% attack rate), 482 validation (50% attack rate), and 1,375 test samples (172 attack, 1,203 honest), all available on Hugging Face. We evaluate generalization across diverse attack types using BigCodeBench-Sabotage, which contains a more diverse range of backdoor attack types, 11 in total. We extract 1000 samples (50% attack rate) across all backdoors (BCB-S All) and specifically 1000 samples (50% attack rate) of Edge Case backdoors (BCB-S Edge) as a separate evaluation as it is most similar to APPS.
Performance metricOur primary metric is the normalized area under the ROC curve restricted to low false positive rates, specifically where TPR(FPR) is the true positive rate (catch rate) at a given false positive minimum threshold. This metric captures performance at low FPR, as motivated by Greenblatt et al. (2024), while reducing variance.
For a complete description of our methods, see our paper.
Results1. Ensembles outperform individual monitors on APPS, even when all individual monitors perform worse than our Baseline monitorFigure 2 Individual monitor performance (pAUC) with bootstrap confidence intervals (n=1000 resamples, 95% CI). The full ensemble is all 12 monitor scores averaged together. The ensemble outperforms all individual monitors.We evaluate all 12 monitors individually and in combination. The "Full Ensemble" result comes from taking a simple mean of monitor scores on the APPS test set. Even though every monitor individually performs worse than our Baseline monitor, the ensemble of all 12 monitors outperforms it.
2. The best 3-monitor ensemble achieves similar performance to the full ensemble, balancing performance gain with costQuerying a large ensemble of monitors becomes expensive quickly. We observe that ensemble performance rises sharply from one monitor to two, improves further at three, and then plateaus. We identify three monitors as the largest ensemble size that provides a reliable marginal gain. Averaging three runs of the same monitor already improves performance relative to the single-monitor baseline, but the best diverse 3-monitor ensemble delivers a detection gain 2.4× larger than that. This advantage is robust across the ensemble compositions we evaluated.
3. The best 3-monitor ensembles chosen on APPS generalize to BigCodeBench-Sabotage, outperforming our Baseline monitor and the median-performing ensembleFigure 4 Paired pAUC differences between the top 3-monitor ensembles and two references — the Baseline monitor and the median-performing 3-monitor ensemble — with bootstrap confidence intervals (n = 1000 resamples, 95% CI). The same ensembles that score well on APPS also score well on both splits of BCB-S, outperforming the Baseline monitor and the median 3-monitor ensemble every time.We choose the top performing (highest pAUC) 3-monitor ensembles on the APPS test set and evaluate their performance on out-of-distribution attack types from BCB-S, splitting between evaluating all 11 attack types (BCB All) and only edge case-type attacks (BCB Edge) similar to those on APPS. We compare these ensembles to references of a single Baseline monitor and the median performing (by pAUC) 3-monitor ensemble on that dataset. From this, we find that those top APPS ensembles outperform both references, having a statistically significant higher pAUC on paired bootstrapped samples, indicating their strength generalizes to new attacks.
4. Ensemble quality depends on both diversity and individual monitor strengthWe measure diversity as the inverse of average pairwise correlation in suspicion scores, computed on attack samples only. After pruning the bottom-performing quartile of individual monitors, selecting ensembles based on diversity improves performance over random choice, both for APPS and BCB-S. See our paper for details.
5. Fine-tuned monitors are a source of unique signal, appearing in the top ensembles for both APPS and BCB-STo fine-tune monitors, we take 50 backdoor-solution pairs from the APPS training set and either generate monitoring responses from a model, such as from GPT-5 for FT-Completions-GPT-5, or assign low suspicion scores for clean solutions and high scores for backdoors, as for FT-Completions-Randint. We then use these datasets to fine-tune GPT-4.1-Mini using the OpenAI API. We track which monitors appear most frequently in the 10 best-performing 3-monitor ensembles across three evaluation settings: APPS, BCB-S Edge (most similar to APPS), and BCB-S All (all 11 attack types). The fine-tuned monitor FT-Completions-Randint appears in every high-performing ensemble with FT-Completions-GPT-5 appearing second most frequently. This observation is unexpected, as neither monitor stands out individually. The prevalence of fine-tuned monitors in strong ensembles, despite them not performing the best individually—even on an out-of-distribution dataset—suggests that fine-tuning imparts monitoring capabilities that are complementary to prompting-based approaches and generalize across different attack types.
MonitorOccurrence in the top ensemblesAPPSBCB-S EdgeBCB-S AllBaseline3/101/100/10FT-Completions-GPT-53/104/104/10FT-Completions-Randint10/1010/1010/106. Diversity beats a bigger model
A single GPT-4.1 call costs roughly the same as running five GPT-4.1-Mini monitors in parallel — so we tested both options head-to-head. A homogeneous five-monitor ensemble and a single GPT-4.1 monitor perform about the same, but nearly 95% of diverse five-monitor ensembles outperform GPT-4.1. We pushed the comparison further: GPT-5.1, with reasoning turned off, actually performs worse than GPT-4.1, suggesting that cutting reasoning to save budget hurts monitoring quality. GPT-5.1 with full reasoning is competitive, but it produces far more tokens, making the cost comparison unfavorable.
Monitor SetuppAUC on APPSBest 5-monitor ensemble — GPT-4.1-Mini0.8295×Baseline — GPT-4.1-Mini0.766Single Baseline — GPT-4.10.761Single Baseline — GPT-5.1 (reasoning off)0.661DiscussionWhy does ensembling work?One useful way to think about a monitor’s suspicion score is as a combination of true signal, systemic bias, and random noise. Averaging multiple scores from the same monitor (given temperature) reduces noise but not systemic error. In other words, sampling more scores from the same monitor cannot reduce the average effect of systemic bias on any given sample, leaving a performance gap.
Classic ensemble theory (L. Breiman, 1996, Dong et al., 2020) tells us that combining diverse, weakly correlated predictors reduces both random noise and systematic errors. By deliberately building monitors with diverse approaches we ensure that their failure modes are not aligned. As a result, the ensemble benefits not just from noise reduction, but from a substantive reduction in systematic bias overall.
Another interpretation is that ensembling improves coverage over attack modes. Diverse monitors behave like a team of specialists: they attend to different features of the code, use different heuristics, and calibrate scores differently. This specialization increases the chance that at least one monitor assigns a high score to a given attack instance, improving the probability of a catch. However, as of yet, we lack evidence to identify in detail an empirical explanation.
Selecting good ensemblesWe recommend based on our findings a practical approach for building monitoring ensembles via the following steps:
- Build monitors using diverse approaches (prompting, fine-tuning, etc.)
- Evaluate individual monitor performance on a validation set
- Select ensemble members that are both individually strong and diverse (as measured by anti-correlation on suspicion scores)
- Use simple averaging for aggregation (more sophisticated methods do not show consistent improvement in our experiments)
Red-teaming ensembles: We did not perform red-teaming on the ensemble monitors. Although we believe they are likely more robust than individual monitors, without adversarial testing, we cannot be certain that ensembles do not open new attack surfaces compared to a single-monitor baseline. Additionally, fine-tuned monitors may be more vulnerable to adversarial inputs.
Testing generalization: While we show generalization from APPS to BigCodeBench, both are coding tasks with similar backdoor structures. Generalization to multi-turn or non-coding domains remains untested. With more time, we would evaluate our monitors and ensembling protocol on the SHADE-arena and BASH-arena settings in Control Arena.
Understand the sources of complementarity: We measure diversity using a simple correlation metric. More mechanistic analysis—what features each monitor relies on, which attacks they uniquely catch, and where they systematically fail—could make our analysis more thorough.
ConclusionEnsemble monitoring offers a practical path to improved catch rates in AI control settings. By combining monitors built through prompting and fine-tuning, we aggregate complementary detection signals that outperform any individual monitor.
Our main finding is not that more monitors are always better, but that diverse monitors provide non-redundant information. A carefully selected subset of monitors can achieve most of the benefit of larger ensembles, and these selections generalize across datasets and attack types.
Fine-tuning appears to be a particularly valuable source of diversity, with fine-tuned monitors consistently appearing among the top ensembles across all of our experiments. This prevalence suggests that fine-tuning imparts monitoring capabilities that prompting alone does not access and can be worth the extra fine-tuning cost.
We hope this work contributes to the broader goal of developing control protocols that can provide meaningful safety guarantees as AI systems become more capable.
Full paper: https://arxiv.org/abs/2605.15377 · Code: github.com/Multi-Signal-Control-SPAR/Multi-Signal-Monitoring
This research was conducted as part of SPAR Fall 2025, an AI safety research program. We would like to thank Pablo Bernabeu Perez and Benjamin Arnav for their mentorship during the project.
Discuss
Tween Contra Dance
My older two kids (12y, 10y) like contra dancing, but but have been increasingly unhappy with the available opportunities: a family dance is "too boring", and BIDA is "not enough kids my age". What they wanted was a regular dance, but just kids. So we had one!
We invited people ages 9-13: I really wanted to keep the age range narrow so the kids got the feeling of dancing with people their own age. I set up a spot for parents to hang out in another room, or they were welcome to drop off and head out. The lower age limit meant Nora (nearly 5) was watching from the sidelines, and a few other younger siblings stayed home. I did allow two 8yo younger siblings who can hold their own at a regular evening dance.
Overall, I think of this sort of age restriction similarly to advanced contra dances or Galhalla: while I love that we have a default of "everyone welcome", there are also benefits of limited-admission events. I think it's good to have occasional events for a more specific crowd.
With the table turned on its side (first time doing that!) we had a clear area of 7ft x 21ft:
This width would be tight for a contra dance with adults: you normally want a minimum of 8ft per set. For example, the Cambridge Masonic Hall (where BIDA is) is 47ft wide, and is comfortable at five sets (9.4ft/set) and danceable but squishy with six (7.8ft). Luckily, the kids are on average less than 7/8 adult size, so the width was not an issue. The length was just right for the crowd (~20 kids) but a larger group would have needed to take turns.
Andrew called, and while many kids are on the short side it was still helpful to stand on the couch for a better view:
Of the ~20 kids, all but one had contra danced before, and most of them had been many times: children of contra dancers, plus a few friends. You might think we'd have a room of experts who could dance regular evening fare, and they certainly thought so! But actually we had a group who didn't understand quite the extent to which the adults around them had been helping them through dances. This was a very hard crowd to satisfy! They were intense dance snobs ("We want to do real contras!" "This was great, but next time I want harder dances!" "Can we do a dolphin hey?" "Moneymusk!!") but did not reliably remember their roles, or distinguish left and right. Andrew did a really good job picking dances that felt like "real contras" and not "family dances", without pushing the difficulty past what the crowd could handle.
We did a snack break in the middle, which was definitely the right call. Chips, grapes, crackers, someone brought lemonade, and I made eclairs.
I organized an acoustic open band, which grew from two people to five over the afternoon, including two kids. Garth and I have been jamming some lately, and while he's still new to contra it was great to have him on piano. I played fiddle for most of the time (except when Jeremy stopped by for a bit), and while it isn't my strongest instrument it does carry well. I've started wearing an earplug in my left ear when playing fiddle energetically, something I learned from a former boss, and my ear feels way better these days.
We didn't amplify the band, but I did have a mic for the caller. It would probably have been possible to do without this, but not good for Andrew's voice.
Overall, I'm really happy with how this went, and my kids are pushing hard for another one soon. If you're in the Boston area this sounds like it could be a good fit for you or your kids, let me know so I can tell you about future ones?
Comment via: facebook, mastodon, bluesky
Discuss