Вы здесь
Новости LessWrong.com
Complexity Science as Bridge to Eastern Philosophy
[Cross-posted from my blog - subscribe if you like this]
I recently gave a talk exploring how complexity science can serve as an authentic bridge between hard science and Eastern contemplative traditions—recovering the philosophical depth that Western mindfulness has left behind. This is a cross-post from my main blog at [pchvykov.com](https://www.pchvykov.com/blog). Watch the full talk below or read on for the key ideas.
The Problem: Mindfulness Without Its FoundationModern mindfulness research has achieved something remarkable: it brought meditation into clinical settings, workplaces, and schools. But in doing so, it stripped away the philosophical bedrock that gives the practice its transformative power.
The original practice of sati (mindfulness in Buddhism) wasn't a standalone stress-reduction technique. It was a devotional practice—a way to remember and integrate deep philosophical teachings about the nature of reality, self, and interconnection. When Jon Kabat-Zinn imported mindfulness into Western medicine, he had to sacrifice much of this context for the scientific community to accept it at all.
The result? Bible study without the Bible. A practice orphaned from its roots.
We're now seeing the consequences: replication studies showing that isolated mindfulness interventions can produce negative effects—increased isolation, reduced social cohesion, a kind of spiritual bypassing where "I can solve all my problems myself." The practice works, but without proper context and theory, we're fumbling in the dark about how and when to use it.
The Solution: Complexity Science as the Missing Bridge
Here's my proposal: complexity science provides the rigorous, hard-science language needed to translate Eastern philosophical wisdom into terms the modern scientific community can validate and build upon.
The parallels are striking:
• Dependent origination ↔ Emergence
• Interconnectedness, wholism ↔ Network science, relational ontology
• Impermanence, change ↔ Chaos theory, dynamics
• Karma, causation ↔ Information-theoretic causality, feedback loops
• Emptiness (śūnyatā) ↔ Self-organization from simple rules
When I studied flocking behavior in my PhD work, I watched simulations where no individual bird directs the flock—yet the collective moves with apparent intelligence and agency. This emergence gave me an experiential understanding of Buddhist emptiness and dependent origination: entities that seem solid and independent are actually patterns arising from relationships and context.
This isn't just philosophical musing. It has concrete research implications:
• Nature of self: Can we derive identity from network studies, understanding the self as emerging from relationships rather than as a fixed entity?
• Emergence engineering: Building systems (like swarm robots or xenobots) that adapt and self-heal, inspired by organic rather than mechanical paradigms
• Language and metaphor: Rigorously studying how all concepts—including scientific ones—are partial metaphors that capture only fragments of reality
Scientists as Shamans of Modernity
We're living through multiple crises: loneliness, meaning, ecological catastrophe. Scientists have become the "shamans of modernity"—the ones society turns to for truth and guidance. But if we only focus on objective, external knowledge, we fail to provide the wisdom about inner experience and values that our role demands.
My spiritual transformation eight years ago moved me from goal-chasing disappointment to genuine depth and meaning. That personal shift convinced me that spirituality, brought rigorously into science, is essential to addressing the meta-crisis. Not as religious dogma, but as validated frameworks for understanding consciousness, interconnection, and transformation.
Join the Exploration
This talk represents ongoing work in the Complexity Science and Contemplative Studies Community (CSCSC)—a collaborative effort to:
• Bridge hard sciences and contemplative traditions
• Develop rigorous theories of mindfulness grounded in complexity
• Turn research itself into a wisdom practice
I'm eager to hear your thoughts—where do you see the intersections between your work and these ideas? What questions does this raise for you?
Discuss
Design sketches for a more sensible world
We don’t think that humanity knows what it’s doing when it comes to AI progress. More and more people are working on developing better systems and trying to understand what their impacts will be — but our foresight is just very limited, and things are getting faster and faster.
Imagine a world where this continues to be the state of play. We fumble our way to creating the most important technology humanity will ever create, an epoch defining technology, and we’re basically making it up as we go along. Of course, this might pan out OK (after all, in the past we’ve often muddled through), but facing the challenges advanced AI will bring without a good idea of what’s actually going on makes us feel… nervous. It’s a bit like arriving in your space ship on a totally unknown planet, and deciding you’ll just release the airlock and step outside to see what it’s like. There might be air on this planet — or you might asphyxiate or immediately catch a lethal alien pathogen.
If you survive, it’s because you were lucky, not because you were wise.
Now imagine a more sensible world. In this world, we get our act together. We realise how clueless we are, and that this is a huge risk factor — so we set about creating and harnessing tools, using AI as a building block, to help us to reason and coordinate through the transition.
Seamless tech helps us to track where claims come from and how reliable they are. Highly customised tools help individuals to understand themselves better and make decisions they endorse. We vastly uplift forecasting and scenario planning, massively improving our strategic awareness, and democratizing access to this kind of information. Coordination tools help large groups to get on the same page. Privacy-preserving assurance tech helps people trust agreements that are currently unenforceable, and negotiation assistance helps us find win-win deals and avoid the outcomes nobody wants.
There’s certainly still a lot of room for things to go wrong in this more sensible world. There might still be very deep value disagreements, and progress could still be very fast. But we’d be in a much better position to notice and address these issues. To a much greater extent, we’d be navigating the transition to even more radical futures with our eyes open, able to make informed choices that actually serve our interests. The challenges might still prove insurmountable! — but because that’s the nature of reality, not because of unforced errors or lack of awareness.
We think that a more sensible world should be achievable, soon — and that more should be done to help us get there.
This series of design sketches tries to envision more concretely how near-term AI systems could transform our ability to reason and coordinate. We’re hoping that these sketches will:
- Help people imagine what this sort of tech might look like, what the world would look like in consequence, and how big a deal that could be
- Encourage builders and makers to push ahead on directions inspired by these technologies
Of course these are just early sketches! We expect the actual technologies that make most sense could in some cases look quite different. But we hope that by being concrete we can help to kickstart more of the visioning process.
Below is an overview of all the technologies we cover in the series, grouped into some loose clusters.[1]
Collective epistemicsTools for collective epistemics make it easy to know what’s trustworthy and reward honesty.
The technologies we discuss are:
- Community notes for everything so that content that may be misleading comes served with context that a large proportion of readers find helpful
- Rhetoric highlighting which automatically flags sentences which are persuasive-but-misleading, or which misrepresent cited work
- Reliability tracking which allows users to effortlessly discover the track record of statements on a given topic from a given actor; those with bad records come with health warnings
- Epistemic virtue evals so that people can compare state-of-the-art AI systems to find the ones that most reliably avoid bias, sycophancy, and manipulation
- Provenance tracing which allows anyone seeing data / claims to instantly bring up details of where they came from, how robust they are, etc.
‘Angels-on-the-shoulder’ are customised tools that help people make better decisions in real time, narrowing the gap between how well people could make decisions, and how well they actually do.
The technologies we discuss are:
- Aligned recommender systems which optimise for long-term user endorsement rather than short-term engagement
- Personalised learning systems which tailor content to a specific person’s needs and interests and intersperse it through their regular activities
- Automated deep briefings which work as an executive assistant to make sure that people have the relevant facts prepared for them for the decisions they’re facing
- Reflection scaffolding which acts as a Socratic coach, helping people to navigate tricky situations and understand themselves better
- Guardian angels which run in the background and flag in real time when someone’s about to do something they might regret
Tools for strategic awareness deepen people’s understanding of what’s actually going on around them, making it easier for them to make good decisions in their own interests.
The technologies we discuss are:
- Ambient superforecasting which allows people to run a query like a Google search, and get back a superforecaster-level assessment of likelihoods
- Scenario planning on tap so that people can explore the likely implications of possible courses of actions and get analysis of the implications of different hypotheticals
- Automated OSINT which gives people access to much higher quality information about the state of the world
Coordination tech makes it faster and cheaper for groups to stay synced, resolve disagreements, identify coalitions, or negotiate to find win-win deals.
The technologies we discuss are:
- Fast facilitation which enables groups to quickly surface key points of consensus and disagreement, and make decisions everyone can live with
- Automated negotiation which discovers complicated bargains in minutes via automated negotiation on behalf of each party, mediated by trusted neutral systems which can find agreements based on confidential information
- Magic network connects people who should know each other (perhaps even before they know to go looking), enabling mutually beneficial trade, coalition building and more
Assurance and privacy tech allow people to verifiably share information with trusted intermediaries without disclosing it more broadly, or otherwise have greater trust in external processes. This can unlock deals and levels of transparency which are currently out of reach.
The technologies we discuss are:
- Arbitrarily easy arbitration which acts as a fast, cheap and neutral adjudicator of disputes
- Confidential monitoring and verification systems which act as trusted intermediaries, enabling actors to make deals that require sharing highly sensitive information, without disclosing the information directly
- Structured transparency for democratic accountability, allowing people to hold institutions to account in a fine-grained way, without compromising sensitive information
This article was created (in part) by Forethought. Read the original on our website.
- ^
A few caveats:
- We haven’t prioritised them
- We’re probably missing good ideas and including some that aren’t great
- The boundaries between clusters and between technologies are fuzzy, and we haven’t invested in them too much
- We’ve tried to lean into overly-specific versions of the tech, in the spirit of concreteness
Discuss
Design sketches for angels-on-the-shoulder
We’ve recently published a set of design sketches for technological analogues to ‘angels-on-the-shoulder’: customized tools that leverage near-term AI systems to help people better navigate their environments and handle tricky situations in ways they’ll feel good about later.
We think that these tools could be quite important:
- In general, we expect angels-on-the-shoulder to mean more endorsed decisions, and fewer unforced errors.
- In the context of the transition to more advanced AI systems that we’re faced with, this could be a huge deal. We think that people who are better informed, more situationally aware, more in touch with their own values, and less prone to obvious errors are more likely to handle the coming decades well.
We’re excited for people to build tools that help this to happen, and hope that our design sketches will make this area more concrete, and inspire people to get started.
The (overly-)specific technologies we sketch out are:
- Aligned recommender systems — Most people consume content recommended to them by algorithms trained not to drive short-term engagement, but to meet long-term user endorsement and considered values
- Personalised learning systems — When people want to learn about (or keep up-to-date on) a topic or area of work, they can get a personalised “curriculum” (that’s high quality, adapted to their preferences, and built around gaps in their knowledge) integrated into their routines, so learning is effective and feels effortless
- Deep briefing — Anyone facing a decision can quickly get a summary of the key considerations and tradeoffs (in whichever format works best for them), as would be compiled by an expert high-context assistant, with the ability to double-click on the parts they most want to know more about
- Reflection scaffolding — People thinking through situations they experience as tricky, or who want to better understand themselves or pursue personal growth, can do so with the aid of an expert system, which, as an infinitely-patient, always-available Socratic coach, will read what may be important for the person in their choice of words or tone of voice, ask probing questions, and push back in the places where that would be helpful
- Guardian angels — Many people use systems that flag when they might be about to do something they could seriously regret, and help them think through what they endorse and want to go for (as an expert coach might)
If you have ideas for how to implement these technologies, issues we may not have spotted, or visions for other tools in this space, we’d love to hear them.
This article was created by Forethought. Read the full article on our website.
Discuss
Eleven Practical Ways to Prepare for AGI
(Adapted from a post on my Substack.)
Since 2010, much of my academic research has focused on the roadmap to broadly superhuman AI, and what that will mean for humanity. In that line of work, I've had hundreds of conversations with ordinary folks about topics familiar here on LessWrong—especially existential risk, longevity medicine, and transformative automation. When I talk about such sci-fi sounding futures, people often respond something like: “Well that all sounds great and/or terrifying, but supposing you’re right, what should I do differently in my daily life?”
So I've compiled eleven practical ways I encourage people to live differently today if they believe, as I do, that AGI is likely to arrive within a decade. These probably won't be revolutionary for most in the LW community, but I offer them here as a potentially useful distillation of ideas you've been circling around, and as a nudge to take seriously the personal implications of short timelines. This can also serve as a bite-size accessible explainer that may be helpful for sharing these concepts with friends and family.
1. Take the Italy trip. As I’ve argued elsewhere, AGI means that the future will probably either go very well or very badly. If it goes well, you will probably enjoy much greater material abundance than you do today. So if you put off that family trip to Italy to save your money, that money will provide a much smaller relative boost to your quality of life in 2040 than if you took the trip today. And if AGI goes badly, you could be literally killed—an outcome well-known to make tourism impossible. Either way, take the trip now. This doesn’t mean you should max out all your credit cards and live a life of short-sighted hedonism. But it does mean that your relative preference for spending money today to saving it for decades from now should be a lot stronger than in worlds where AGI weren’t coming. Concretely, if you’re in your 30s or younger, you’ll usually be better off spending any dollar you make today than waiting to spend it after 2050.
2. Minimize your lifestyle risks. If you’re 35 and get on a motorcycle, you are—at least implicitly—weighing the thrill and the cool factor against the risk of losing about another 45 years of expected life. But AGI medical advances will let people live healthy lives far longer than we currently expect. This means that by riding the Harley you might be risking several times as many years as you intended. If that’s your greatest bliss in life, I’m not telling you to never do it, but you should at least consciously weigh your choices in light of future longevity. For Americans ages 15-44, about 58% of mortality risk comes from three causes: accidents, suicide, and homicide. You can dramatically cut your own risk by limiting risky behaviors: avoid motorcycles, don’t binge drink or do hard drugs, don’t drive drunk or drowsy or distracted, attend to your mental health, and avoid associating with or especially dating violent people. Yes, AGI also means that long-term risks like smoking are probably less deadly for young people than current statistics suggest, but smoking still hurts your health on shorter timescales, so please don’t.
3. Don’t rush into having kids. Many women feel pressure to have children by a certain age for fear they’ll be infertile thereafter. This often leads to settling for the wrong partner. In the 2030s, fertility medicine will be much more advanced, and childbearing in one’s 40s will be roughly as routine as for women in their 30s today. So Millennials’ biological clocks are actually ticking much slower than people assume.
4. Back up irreplaceable data to cold storage. As AI gets more powerful, risks increase that a sudden cyberattack could destroy important data backed up in the cloud or stored on your computer. For irreplaceable files like sentimental photos or your work-in-progress novel, download everything to storage drives not connected to the internet.
5. Don’t act as if medical conditions are permanent. Doctors often tell sick or injured people they will “never” recover—never see again, walk again, be pain-free again. AGI-aware decisionmaking treats medical “never” statements as meaning “not for 5-20 years.” Most paralyzed people middle-aged and younger will walk again. This also implies that patients should often prioritize staying alive versus riskier treatments aimed at cures today. It also gives reasonable hope to parents considering abortion based on predictions that a disabled child will have lifelong suffering or debility.
6. Don’t go overboard on environmentalism. AGI or not, we all have an obligation to care for the earth as our shared home. Certainly be mindful of how your habits contribute to pollution, carbon emissions, and natural resource degradation. But AGI will give us much, much better tools for fighting climate change and healing the planet in the 2030s and 2040s than we have today. If you can give up a dollar worth of happiness to help the environment either today or a decade from now, that dollar will go a lot farther later. So be responsible, but don’t anguish over every plastic straw. Don’t sacrifice time with your family by taking slower public transport to trim your CO2 impact. Don’t risk dehydration or heat stroke to avoid bottled water. Don’t eat spoiled food to cut waste. And probably don’t risk biking through heavy traffic just to shrink your carbon footprint.
7. Wean your brain off quick dopamine. Social media is already rewiring our brains to demand constant and varied hits of digital stimulation to keep our dopamine up. AGI will make it even easier than today to get those quick hits—for example, via smart glasses that beam like-notifications straight into our eyes. If you’re a slave to these short-term rewards, even an objectively amazing future will be wasted on you. Now is the time to seek sources of fulfillment that can’t be instantly gratified. The more joy you find in “slow” activities—like hiking, tennis, reading, writing, cooking, painting, gardening, making models, cuddling animals, or having great conversations—the easier it will be to consume AGI without letting it consume you.
8. Prioritize time with elders. We know that our years with grandparents and other elders are limited, but the implicit pressure of our own mortality often pushes us to skip time with them in favor of other things that feel fleeting—job interviews, concerts, dates. If you expected to live to a healthy 200 due to longevity medicine, but knew that most people now in their 80s and 90s wouldn’t live long enough to benefit, you’d probably prioritize your relationships with them more than you do now. There’ll be plenty of time to hike the Andes later, but every moment with the people who lived through World War II is precious.[1]
9. Rethink privacy. There’s an enormous amount of data being recorded about you that today’s AI isn’t smart enough to analyze, but AGI will be. Assume anything you do in public today will someday be known by the government, and possibly by your friends and family. If you’re cheating on your spouse in 2026, the AGI of 2031 might scour social media data with facial recognition and find you and your paramour necking in the background of a Korean blogger’s food review livestream. It would be like what happened to the Astronomer CEO at the Coldplay concert last year, except for anyone in the crowd—no need to wind up on the jumbotron. And not only with facial recognition. The vein patterns under our skin are roughly as uniquely identifying as fingerprints, and can often be recovered from photos or video that show exposed skin, even if not obvious to the naked eye. So if you’re doing something you don’t want the government to tag you with, don’t assume you can stay anonymous on camera as long as your face isn’t visible.
10. Foster human relationships. When AGI can perform all the cognitive tasks humans can, the jobs most resistant to automation will largely revolve around human relationships. The premium will grow on knowing many people, and being both liked and trusted by them. Although it’s hard to predict exactly how automation will unfold, honing your people skills and growing your social circles are wise investments. But human relationships are also central to life itself. Even if AGI gives you material abundance without work, such as via some form of universal basic income, human relationships are essential to the experience of life itself. If you are socially isolated, AGI will give you endless entertainments and conveniences that deepen your isolation. But if you build a strong human community, AGI will empower you to share more enriching experiences together and come to know one other more fully.
11. Grow in virtue. In the ancient and medieval worlds, physical strength was of great socioeconomic importance because it was essential to working and fighting. Gunpowder and the Industrial Revolution changed all that, making strength largely irrelevant. In the modern world, intellect and skill are hugely important to both socioeconomic status and our own sense of self-worth. We’re proud of being good at math or history or computer programming. But when AGI arrives, everyone will have access to superhuman intelligence and capability, cheaper than you can imagine. In that world, what will set humans apart is virtue—being kind, being wise, being trustworthy. Fortunately, virtues can be cultivated with diligent effort, like training a muscle. The world’s religious and philosophical traditions have discovered numerous practices for doing this: volunteering and acts of service, meditation or prayer, fasting and disciplined habits, expressing gratitude, listening humbly to criticism, forming authentic relationships with people of different backgrounds, studying the lives of heroically virtuous people, and many more. Explore those practices, commit to them, and grow in virtue.
Discuss
Answer in your head
Once upon a time I was in a very frazzled and vulnerable state. I was leaving a coffee shop and about to hop on my bike to head over to a bar to watch the Steelers game. A homeless guy named Joe[1] started talking to me.
He said I have a nice bike. I thanked him, said I love biking, and asked if he bikes. He said that he does but that his bike had recently been stolen which makes him sad. I mentioned that I’ve had bikes stolen too and we vented a bit about bike theft.
Then he confessed that he’s actually stolen bikes before. And furthermore, that he’s tempted right now to steal a bike. His old bike was stolen and he really could use a way of getting around. But he’s trying really hard to be a good person and doesn’t want to steal anymore.
I found this to be impressive and inspiring. I imagine that in a state as desperate as his it’d be really, really tempting to rationalize that bike theft is justified. And hey, maybe it is. Regardless, I know that I rationalize tons of things when it’s convenient and I’ve never been nearly as desperate as he is. So yeah: impressive and inspiring.
Anyway, he proceeded to confide in me about some of his life difficulties. I smiled and explained that despite surface appearances, he and I are actually not all that different. We looked at each other and I’m not sure which one of us took the initiative, but he ended up asking me if I want to talk, and I said yeah. It’s something I’ve always wanted to do.
Joe hustled over to the exterior seating of the coffee shop where he had all of his belongings. He started clearing space for me to sit. It was as if it was his living room and he had shit everywhere. As if he wasn’t expecting to have company and was trying to be a good host.
We sat down and started talking. He agreed to go first since I was a little nervous. He was strong.
He told me his life story. I won’t go into the details, but it’s a lot of stuff that you might imagine someone on the street to be struggling with, and more. It’s what Woody Allen would categorize as “the horrible”.
Then it was my turn. There was something I had been holding on to for a few weeks and itching to talk about. Something that was really driving me crazy. I considered confiding to Paul about it but it didn’t feel right. The pocket knife, dissociations, history of violence and antisocial personality disorder might have had something to do with it.
Anyway, there was something smaller yet still pretty huge that I was struggling with and wanted to get off my chest. It felt right, so I said it.
It’s a little dark and not something I want to elaborate on. Joe had a follow up question for me that kinda hit the nail on the head. It was something along the lines of, “What is the thing you feel such strong shame about?” He said it with an element of empathy and directness that felt like it was coming from a therapist.
I paused and thought for a few moments. He saw me thinking and said something that I’ll never forget: “You don’t have to answer out loud, you can answer in your head.”
It felt so perfect. Like, if you take the space of all possible responses Joe could have had, I dunno, that one has got to be pretty high up there. It felt very wise, which was kinda funny coming from someone who looked like they had been at a rave for the past 72 hours.
Anyway, that’s what I wanted to share. This idea that “answering in your head” is an option to be considered, particularly when facing hard questions. Most of the time the important thing is finding the truth and acknowledging it to yourself, not necessarily to the other person.
I’ve always kinda coupled the “what you say to the other person” part with the “what you say to yourself” part. In retrospect this is silly: they can be pretty easily decoupled. My having coupled them was some sort of cached thought, or behavior, I suppose. I’m grateful to Joe for busting that cache for me and am glad that I stopped by to talk with him.
- ^
Not his actual name.
Discuss
Three visions for diffuse control
This post represents my own views, not the views of my employer.
Some people hope we can force misaligned AI systems to automate alignment research by controlling them. I'll call this the "diffuse control" plan. This plan has several unfortunate properties:
- It involves waiting to solve the alignment problem until later. Procrastination is a bad idea because:
- We don't know how much serial time the problem will take to solve.
- If the problem turns out to be extremely difficult, we might only realize this too late to coordinate and avert disaster. Or, we might not realize that the problem is difficult (until we are violently disempowered).
- We don't know what the situation will look like in the future--it could change in unfavorable ways. For instance, maybe in the future the leading AI project is not willing to spend resources on AI safety, in which case the plan of having AI systems solve alignment will likely fail.
- It involves delegating solving the alignment problem to an untrusted party.
- Often, when something really important needs to get done, it's best you take responsibility for doing it yourself.
- It seems like it might be pretty easy for an AI to just not try its best on doing alignment research, or to intentionally do the research ineffectively.
- Alignment research might be really hard to oversee. It might be hard to evaluate what constitutes progress.
Here are the other plans that I'm aware of for averting AI doom:
- Hope that misalignment isn't an issue.
- Shut it down right now and then solve alignment later (i.e., the MIRI plan).
- Wait until AIs are powerful and then:
- Catch the AI red-handed and shut down (i.e., high-stakes control).
- Demonstrate risk with dangerous capabilities evals and model organisms. Then, possibly with the assistance of controlled AIs, coordinate a pause.
- Use the AIs to develop military technology and sabotage other AI projects, and then solve alignment after that (i.e., do a "pivotal act").
- Solve alignment ourselves before people build powerful AI (e.g., ARC, agent foundations people). It's certainly ok under this plan if the humans get some AI assistance, but in the near term humans are mostly working on the problem themselves.
- Solve alignment iteratively, incrementally increasing the set of tasks where the AI is intent-aligned, using lots of AI assistance to perform this iterative approach (in fact, we'll hand off to the AI at some point). (This is probably called "prosaic alignment" or "scalable oversight").
In the rest of this short post I'll discuss several arguments for why the diffuse control plan might or might not be a reasonable approach to pursue in parallel with these other approaches despite the distasteful aspects of the diffuse control plan discussed earlier. I'll conclude by discussing what actions make the most sense under different arguments in favor of the diffuse control plan.
3 arguments for getting misaligned AIs to solve alignment- Humans are too dumb to solve the scalable alignment problem. But we could maybe control and oversee smarter entities solving the alignment problem.
- Solving alignment requires a huge quantity of prosaic alignment research. If the size of the AI safety field were 1000x bigger, we wouldn't bother with having AIs solve alignment for us. But it's plausible that we can't field build sufficiently quickly to 1000x labor on the problem, so we should try to get AIs to provide this labor and hope that leading AI company can spend a sufficient amount of compute on running automated AI safety researchers.
- Solving alignment is much easier if we can do fast iteration on actual AGIs. Some of this research may even be conducted by humans. But it's going to be important to have good AI assistance in order to iterate fast enough. So we need to be able to control these AIs in-case they're misaligned.
These arguments all seem kind of plausible to me.
Implications of these arguments for the diffuse control planImplications of 1: If we think that "solving alignment" centrally looks like resolving some complicated conceptual and theoretical questions, rather than doing empirical prosaic ML research, then diffuse control research should probably work in settings similar to this.
Here are some ideas for research we could do:
- Maybe we should be trying to make debate (or similar flavor things) between misaligned AIs go better in domains that are very conceptually confusing.
- Maybe we should be iterating on eliciting good work out of AIs (or human-AI systems) in confusing domains like theoretical alignment research.
- Maybe we should be trying to understand the attack-surface here better.
When doing this, it's good to do things which are differential for safety progress and not just do general capabilities research.
Implications of 2: If the main plan is to do a large volume of prosaic alignment research on powerful AIs, then:
- It seems good to get better at building model organisms.
- It seems good to practice eliciting good prosaic alignment work from misaligned AIs.
- Especially the harder parts like coming up with good project proposals.
- Maybe we should try to write a bunch of project proposals that we hope will be automated later.
Having a strong bias against delegating and delaying solving important problems is generally good.
I previously hadn’t sufficiently priced this in, and so after thinking about this I’m more excited for scalable alignment moonshots, scalable oversight research and “trying to automate alignment research right now and seeing what the bottlenecks are” (although this is also complicated by having capabilities externalities).
Discuss
Observations and Complexity
It may seem that there are to separate drives in the pursuit of truth.
The first one is observation - our theories are supposed to correspond to reality and, therefore, fit the observable evidence. The theory that fit the evidence is in some sense better than a theory that doesn’t.
The second drive is complexity. A simpler theory is in some (other?) sense better than a more complex one.
But suppose we have a simple theory that doesn’t fit some of the evidence and a more complex one that does. Which one is better? How much correspondence to evidence outweighs how much complexity?
Maybe no amount of complexity can outweigh theory not fitting the evidence? We use the complexity consideration only between two theories that fit the evidence equally well. And as soon as one theory is a better fit evidence-wise, we immediately side with it. This makes sense at the first glance.
But consider, some theory that is designed to fit any evidence. If we ignore all the complexity considerations there is nothing preventing us from designing it. For instance:
- Everything just randomly happened to be exactly the way it happened.
- God did it according to His will which is ultimately unknowable.
- Magic!
- Everything works according to the previously established model, except every event that happened to contradict it, which are just unexplainable phenomena.
- And so on
Should we be abandoning established reductive materialism framework and switching to one of such theories every time we observe something slightly unexpected? Should we never have adopted reductive materialism in the first place because there are always some yet unexplained phenomena?
Wait, we know the answer to this one, don’t we? A good theory is supposed to predict the observations beforehand, not just explain the retroactively. Such theories do not predict anything so they are very very bad!
Okay, great, now we have a third parameter in our epistemology. Instead of just observations and complexity we have: predicted observations, unpredicted observation and complexity. How do all these three things work together? Maybe we are supposed to ignore all the evidence that wasn’t predicted?
Seems that the complexity of our epistemological theory is rising. Are we supposed to ignore it? Or should we take the hint and try to approach the problem from another side?
What do we event mean when we say a theory is good. It sounds suspiciously like rationality in terms of social norms instead of engines of cognition.
It has something to do with how likely the theory is to be true or how close it’s to the truth. In which case we are talking about some sort measure function.
How do we judge what is more likely or closer to truth? Pretty much by looking. We look at the reality with our organs, which were made by natural selection to be correlated with it.
Okay, sure, but isn’t it just the observation part? What about complexity? Where is it coming from?
Oh, but it’s about both of the parts. Your brain is also an organ produced by natural selection; it’s also part of reality. Reasoning about complexity is looking at the outcome that your brain produces based on the input data of your observations about which theories systematically tend to be closer to truth. Complexity is generalized observation.
Consider how we came to the modern notion of computational complexity in the first place. Occam’s Razor used to be a theistic argument. There were times when saying “God did it!” was universally considered extremely simple. And while nowadays there are still some people confused by it, the sanity waterline has risen tremendously. How did that happen, I wonder?
Nowadays we have general notion of probability theory with the principle of conjunction behind the complexity
P(AB) < P(A)
And bayes theorem describing the updating process due to observations.
P(A|B) = P(B|A)P(A)/P(B)
With this math we can see how complexity and observation compensate each other. How every additional element of the theory adds some degree of improbability to it that can only be reduced by observing that this element is likely to be true. And vice versa. When an observation happens to contradict predictions of our theory, we can always add an epicycle that would explain away the contradiction. But this action is not free. Every new element of the theory adds extra complexity. Increases the overall improbability.
And so, the two apparent drives of epistemology are reduced to a single one. To the notion of (im)probability. This doesn’t solve all the epistemological confusions. But it sure helps with a lot of them.
Discuss
A Perfect Resurrection
Personal identity is a topic that I associate with a big tangled ball of yarn—you have to spend a long time untangling it to get to something that resembles truth.
Well, I started doing this intellectual work and got stuck on one question that doesn’t seem to be discussed very often (at least, I haven’t encountered it in the exact framing I’m going to present here).
So, imagine we copy a person named Roger, and we do it with absolutely perfect accuracy—down to the quantum properties of every particle that makes up his body. Or, if you prefer, with any arbitrarily high accuracy.
Now we have two Rogers—R1 and R2. And it seems obvious that they do not share one and the same consciousness: they are two different consciousnesses of identical personalities.
(If this isn’t obvious to you, I’d genuinely like to hear how two brains, isolated from each other, could share a single consciousness. Obviously, I’m a physicalist.)
Because the copied person is located at a different point in space, by definition he:
• cannot be physically connected to the original person (R1)
• will instantly begin to have different experiences and will stop being an absolutely exact copy (since even R2 being 10 cm away from R1 is already a pretty significant difference in experience).
In our reality, if object A and object B occupy the exact same point, they are the same object.
So it’s impossible to place a copy at the same spatial address as the original. Doing that would mean doing nothing.
Now, we have two Rogers. And they do not share the same subjective experience. Then what grounds do we have to believe that a teleportation machine really “transports us,” rather than transporting “someone else”?
Yes, it’s a perfect copy. As I said, we can imagine it even preserves the properties of the elementary particles that make up Roger’s body. But that exact copy—which truly is Roger—still contains another, new consciousness.
Right before writing this post, I also read a post by Rob Bensinger where he discusses a similar situation.
If I understood Rob correctly, his view is that if R2 feels that he “is Roger,” and he really is a sufficiently accurate copy—then why should we care?
Rob says that if it turned out we die and are reborn every second, while still feeling like the same person, nothing would change.
He writes:
If you'd grown up using using teleporters all the time, then it would seem just as unremarkable as stepping through a doorway.
If a philosopher then came to you one day and said "but WHAT IF something KILLS YOU every time you step through a door and then a NEW YOU comes into existence on the other side!", you would just roll your eyes. If it makes no perceptible difference, then wtf are we even talking about?
But I think we can push back on that perspective.
Imagine someone offers you a deal: you are forced to kill your entire family, but then your memories of it are erased and an illusion is created in which your loved ones are alive and happy. Would you agree to that?
Let’s consider another version of the question—at least to me they seem similar (oh god, and here come identity problems again!):
If Rob Bensinger were offered this: your body will be completely destroyed, but in exchange a being will be created with an incredibly stable conviction that it is Rob Bensinger, who never died and was never destroyed—what is the probability that the real Rob Bensinger would accept?
(In fact, I suspect that probability is high.)
You might want to say these examples aren’t the same. Sure, they aren’t identical—but I think they’re pointing at the same thing: for us humans, what often matters is what actually happened, not just how we experience or interpret the event. I think this even connects to the very idea of rationality.
In both examples, you can choose a “sweet lie” over a “bitter truth,” even though on the surface the examples really do look quite different.
I would not agree to be destroyed, even if I knew that an exact copy of me would exist, convinced that it never died. Because I know that my own subjective experience would end, and the experience of “another me” would begin.
So why am I asking this?..
Imagine a resurrection machine. Suppose it’s an unimaginably powerful, gigantic computer—something like a Matrioshka brain around WR-102. Based on indirect evidence, this machine has computed the exact structure of the brain of our deceased Roger (let’s call him R1D1, where D stands for Dead).
The machine recreates an exact copy of Roger, R2D2. Yes, it really is him! There’s no doubt it’s the same personality.
But how can I be confident that… it’s the same conscious experience?
That it’s the same consciousness that ceased to exist at the moment R1D1 died?
I call this idea “perfect resurrection,” because we could stop at resurrecting a sufficiently similar, or even a perfect, copy of Roger. But what interests me (almost purely in a philosophical sense) is whether it’s possible to bring back the original consciousness itself, rather than a “second instance” of that consciousness.
So that Roger would say:
“Oh wow, what a day. I felt terrible, and now I woke up completely healthy. What the hell?”
And it would be literally true.
So that this wouldn’t just be a special case of my earlier proposal: “destroy Rob Bensinger and create a being that’s convinced it is Rob and never died.”
Perhaps—even hypothetically—there can be no way to verify whether it is “the same” consciousness. Perhaps I made some serious mistake in my reasoning.
In any case, I hope these thoughts will give many of you an opportunity to exercise your imagination and thinking. And of course, I’ll be doing that along with you.
Discuss
The Two-Board Problem: Training Environment for Research Agents
Novel theories or conceptual extensions of mathematics are core to human progress, but are uniquely challenging to construct. The environment, be it physical reality or known formal system, provides the agent with limited observations, which it uses in an attempt to assemble a predictive model or policy.
Asymmetry manifests differently: from the limited sampling to the structural properties of the problem itself. The latter cases involve: Complex numbers discovery while observing only Real values, description of continuity to explain the behaviour of countable objects and other historical examples. Those are both the motivation and the main focus of the Two-Board Problem.
I designed the framework that captures the main structural properties of such a problem class and allows expressing them in an MDP suitable for benchmarking the ML approaches or as a source of data for deep learning methods. This article focuses on the explanation of the problem and its design choices, while the implementation example is available on Github.
The Two-Board Problem, as it is probably evident by its name, consists of two boards available for the agent. The first board, called the Real board, provides the agent with feedback about its solutions, and the operations defined by the formal grammar. The canonical case involves real numbers operations - it is selected due to the fact that physical measurements are expressed in those.
The second board, called Imaginary, is of more interest to us. It is not governed by any grammar, and allows writing arbitrary strings, e.g. from UTF-8. It acts as an optional scratchpad for the agent.
The agent can execute any operations defined by the grammar of the Real board to transform its contents and get the reward. As an example of the task we can take finding the roots of polynomials, or solving any other equation - find an argument, substitute it, achieve equality →.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')} get rewarded.
On the Imaginary board, the agent can write any arbitrary strings, which may help it to solve the equation. Strings from the Imaginary board can also be substituted into the Real board, provided they satisfy the Real board's formal grammar.
From here on, I will refer to the Real board as R and the Imaginary board as I, mirroring R and I respectively.
If you're interested in formal definition, you can find it below, but it is not necessary for understanding the core of the problem:
MDP (S,A,T,R) where:
- State s∈S: the contents of both boards - R holds expressions satisfying formal grammar G, I holds a list of arbitrary strings, plus metadata (step count, found solutions).
- Actions a∈A: operations defined by G applied to R; free writes to I; and cross-board substitution (replace a symbol on R with an expression extractable from I's strings, provided it satisfies G). Terminal declarations end the episode.
- Transitions are deterministic - defined by the semantics of G.
- Reward: the agent receives positive reward upon verified solutions on R, and at episode termination based on correctness and completeness. The canonical instantiation uses the field operations over Q as G, with polynomial root-finding as the task.
The Imaginary board may look like a nice addition for writing an agent's thoughts - and for simple tasks it largely is. The core of the problem becomes apparent in the scenarios where using Imaginary board is the only way to achieve the reward. A canonical example is finding roots of polynomials with rational coefficients - where the agent must construct expressions for the argument substitution - or declare that no such expression exists. Here is an explanation why for some of the polynomials using I is necessary:
For the linear and quadratic equations the problem is fairly trivial, and the operations on R are sufficient. The agent can operate using the actions provided by grammar G, and make optional notes. However, after the degree 3 - cubics - the situation changes dramatically. In the general case there provably exists no valid sequence of actions on the Real board which will find a root. There are some trivial cubics, like x3−8=0 for which the solution can be found. But for an example like x3−3x−1=0 - our data-centers can work till the end of eternity, and it won't get us one bit closer to the solution. Yet, for humanity it took much less - just 1800 years from the inception of the problem statement to the general solution computable with quill and paper. And this is where our board I comes into play. The only way for an agent to finish this task is to use it, and do it creatively.
The issue with the equation x3−3x−1=0 is not only that we need a lot of clever substitutions, but also that the result of those would yield this: √−3/4 as one of the intermediate steps. Our grammar over board R can't express such term and operations with it. Those just... don't exist. Undefined.
The agent - in the historical case Cardano and a few other mathematicians, can resolve this problem creatively. How? I will skip the full explanation available on Wiki, but one of the intermediate transformation steps will give us the form:
u3+v3−q=0
This means, that if u3=(a+b) and v3=(a−b), the b part, whatever monster it is - will cancel out on the real board. Our G does not allow an arbitrary substitution with unknown properties to interact with the Real numbers. But what the agent can do, is to notice that the b parts will erase into nonexistence due to expansion a+b+a−b=2a - whatever they are. This allows us to express √−3/4=√3/4i, where i is just "some strange function real board doesn't express but it passes the grammar for substitution", and observe how it cancels out.
All the substitutions, especially ones which implicitly contained non-allowed terms, were Imaginary board moves. While creative and purposeful, those were absolutely not guaranteed to work until verified on the Real board. Moreover, nothing from direct observations on R can guide an agent to find those operations. They are undefined for the Real board - we have zero information about them. Intermediate steps would not yield any feedback. This makes it insanely hard for an agent to navigate the search space - and that's what we observe. It took centuries for humanity to invent √−1 - probably the most fundamental concept in modern science and physics.
And if you think that was a hard task, well... It doesn't end here. As soon as we reach the degree 5, the Imaginary board would require constructing Galois theory and the related group for polynomial roots. This is the only way to understand if there is any solution findable in I we could use for substitution given R constraints. For some polynomials - there will be a solution, and for some - it would be impossible, and we're back to data-centers working till the end of eternity.
One may ask something like: "Ok, we just need to explore Imaginary board - at some point the right string combination appears, we substitute or decide unsolvable - for sure we can just throw more compute into that?". The issue with this approach is that each write operation to the scratchpad will result in multiple combinations available for substitution. In other words - using Imaginary board leads to combinatorial explosion of both state space and effective action space. Brute-force creates more problems for the brute-force.
Another approach is to use geometric or numeric approximation methods for solution on I in an attempt to find the solution. The problem is, however, that it doesn't yield a term that the Real board grammar would accept.
This means that our favorite tools are not applicable for the Polynomial Two-Board Problem. If applied directly, those lead to a dead end. And yet, we know it's solvable.
We know that Cardano, Ferrari, Abel, Galois somehow trained on this Polynomial environment and were able to come up with solutions for their respective pieces. Unsurprisingly, this environment is very nasty to train on. Some of its "nice" properties:
- Enormous state space and action space, both exploding - as already mentioned
- Sparse reward - valid roots are the only thing rewarded, and full reward is only available for finding all of them
- Non-smooth gradient for R - you can try to sample (x,y(x)) reals and train with back-prop - but some solutions will be transcendental and our particular grammar of R doesn't allow them - it's 150 years before calculus and ~300 before the terms are coined
- Zero explicit gradient on I - it's connected to R via substitution terms and those explode. Other heuristics are non-trivial - we'll discuss it later
- No obvious direct transferability of previously learned strategy from degree n to n+1
- After degree 5 valid solution space includes "declare unsolvable" - and agent doesn't know that, and doesn't know that it becomes useful only for quintics and further
To summarize it, the agent operates in the environment, where almost every action exponentially increases the entropy of exploration policy without any positive reward signal or other explicit feedback which would help in navigation. Under these conditions, it must develop the heuristics or strategy. Those must be efficient enough to determine a sequence of several valid steps in this chaos, and taking them must produce the unique set of strings valid for R. On substitution those yield the reward and make the problem collapse to the solution state of "0=0". [1]
Since the environment is adversarial towards our training objective, we can converge on the following explanation: the agent should have some internal heuristics or bias to counter the growth of the state space. Those must be sufficient for finding an algorithm to guide the agent, without representing the solution itself - encoding complex numbers and Galois groups is cheating and doesn't represent the epistemic position of Cardano and Galois respectively.
This property rules out any LLM which is trained on the data including complex numbers or its descendants. This poses a challenge to test those architectures on the specific Polynomial case - any current LLM saw the complex numbers. Since the data will contain the solution - we won't be able to distinguish memoized interpolation versus genuine discovery using the methods available in the training corpus. [2]
This might sound like an issue of the framework, but that inconvenience is the main feature of the Two-Board Problem. It was intentionally designed to test novel creative reasoning, by making the environment and the data available to the agent insufficient for finding a path to the solution given only explicit information about R. It encodes a historical precedent where "novel reasoning" has a precise formal meaning: constructing objects that provably do not exist in the given system, and yet can be verified.
If we rule out unsatisfactory explanations of hypercomputation, quantum coherence at room temperature, alien communication and divine blessing, we should conclude that the computable heuristics for such an agent should exist. What does it look like? Honestly, no idea - I can only speculate towards symmetries and entropy reduction. Those are respectively what we try to discover and what we try to fight. But even if we look towards symmetries - the issue persists - what heuristics would allow us to discover them, without encoding the solution in the style of GDL? [3]
The core feature of the Two-Board Problem is its hostility towards any learning approach, and simultaneously obvious exemplification in the real historical precedents. It would be strange to assume that Cardano had access to imaginary numbers, Galois to symmetry groups or that Newton could directly observe and comprehend infinity. The explanation that those are somehow encoded on the unconscious level and manifest in the brain directly is somewhere on the level between Harry Potter Fanfic and tabloids. "Evolutionary priors encoding the information necessary for inventing Galois groups due to their necessity for hunting" sounds more sophisticated, but also answers neither why nor how. Yet we can observe this phenomenon and use the technology based on those discoveries on a daily basis.
I don't have access to the datacenter for performing the ablation study on transformers or stress testing AlphaProof or AlphaGeometry on this specific problem. But to the best of my knowledge, none of those systems have produced a novel axiom expanding current mathematics - they intelligently navigate known formal systems. Such discovery - passing Two-Board test which necessarily involves new axiomatics - would be the headline of every corner of the internet given the trend of recent years.
In essence, the Two-Board Problem environment has a counter to every method I could list:
- Program synthesis on I would blow up the state/action space
- Evolutionary / genetic programming - same thing
- Curiosity-driven exploration - same thing, we don't know when to cut branches - no feedback
- MCTS - would need to evaluate intermediate states - but reward is not available and computable heuristics are currently unknown, and we're back to branching
- Gradient descent based methods - holes in gradient over R, explicit obvious form doesn't exist over I
- RL/Bayes - sparse rewards would make you iterate policy/hypothesis too slowly to be practical, and I wouldn't be surprised if those never converge
- GDL/Transformers - encoding/memoization of properties external to the problem environment is prohibited, and ability to discover them inside is unproven until they pass benchmark under ablation criteria
The example environment implemented in python takes 500 lines with sympy package being the only dependency. And yet, I can't figure out the solution. If one would like to try - the code is open and any LLM would happily wrap an MDP environment to gym.
The philosophical question of what creativity is and what novelty means is probably one of the hottest battlegrounds of internet discourse of recent years. I find this question meaningless, and propose to instead answer one having objective evaluation criteria: "can an agent pass the Two-Board Problem test?" The setup is fairly straightforward - given the observations over the Real board and no prior information about solution inexpressible in R's grammar the machine should demonstrate the ability to construct one using only the scratchpad I.
Some questions given the problem complexity could arise: Is this AGI? Does it mean that "If Anyone Builds It, Everyone Dies"? Is this a peak of an agent's intelligence, limited only by Gödel above it? I don't know, but this condition looks necessary for truly general intelligence, given we observe such examples. Its sufficiency is yet to be decided.
As a practical person, I restate my perspective that philosophical debate over those questions is meaningless, and to get the answers one should approach the problem theoretically or practically.
From a theoretical research perspective, the Two-Board state is a trivial MDP expressible in meta-board R1. This means that one might find some interesting properties and extensions if they wander around in the meta-space of I1 long enough, guided by whatever the inexpressible heuristic for Meta-Two-Board is.
From a practical and experimental standpoint, one might try to research the Two-Board by attempting to domesticate the branching chaos of I by any known or yet unknown means.
Regardless of which path you prefer, by this paragraph you have the formal definitions, gym-compatible environment, and open questions - all in repository. And as @Eliezer Yudkowsky wrote in my favorite essay - if you know exactly how the system works, you can build one from buckets and pebbles.
Have fun building.
PS:
"Don't panic" - Douglas Adams, The Hitchhiker's Guide to the Galaxy
After writing this paragraph, I must admit that while I don't agree with Penrose on theoretical explanation, he had a point about computational complexity. Won't be that as sure about non-computability - that's much stronger claim. And it still doesn't change my skeptical stance regarding quantum effects in the specific proposed form. ↩︎
For ablation study on the Polynomial case we could use the writings before 1540 and generate additional synthetic data using algebraic, geometric and physics apparatus available at that time. The other approach is to express in the Two-Board framework one of the more recent cases of discoveries leading to novel concepts - which would be much easier to ablate. Non-invertible/categorical symmetries might be a good candidate. ↩︎
Geometric Deep Learning - an excellent framework by Michael Bronstein et al. that leverages known symmetries and invariances for deep learning architectures. The Two-Board Problem asks the complementary question: what if the relevant symmetries are not yet known? ↩︎
Discuss
Join My New Movement for the Post-AI World
I am starting a new movement. This is my best guess for what we should strive for and value in a post-AI world. A few close friends see the world the same way and we are starting to group together. It doesn’t have a name yet, but the major ideas are below.
If anything in here resonates with you I would love to hear from you and have you join us. (Also I am working on a longer document detailing the philosophy more fully, let me know if you would like to help.)
THE PROBLEMOur lives are going to change dramatically in the near future due to AI. Hundreds of millions of us will lose our jobs. It will cost almost nothing to do things that used to take a lifetime. What is valuable when everything is free? What are our lives for, if not to do a task, receive compensation, and someday hope to idle away our time when we are older?
Beyond the loss of our work, we are going to struggle with meaning. You thought you were enslaved to the material world by your work, but it was those very chains that bound you to the earth. You are free now! How does it speak to you? Have you not found terror about the next step, a multiplicity of potential paths, a million times over, that has dissolved any clear direction?
You have been promised a world without work. You have been promised a frictionless, optimized future that is so easy it has no need for you to exist.
You have been lied to.
This type of shallow efficiency is not a goal of the universe. In fact, in a cosmos that tends towards eventual total disorder, it is a whirlpool to the void.
You have been promised a world without work. We offer you a world of Great Works.
THE REALITYThere is only one true war we face: it is the battle between deep complexity and the drift to sameness. The Second Law of Thermodynamics shows that information tends to fall into noise. Our world is not a closed system, indeed we need the sun to survive, but life is a miraculous struggle that builds local order while sending disorder elsewhere.
We call this Deep Complexity, or negentropy. We are not interested in complex things for the sake of being complicated alone. We value structures that are logically deep (they contain a dense history or work), substrate independent and self-maintaining where possible. The DNA that has carried you through countless generations to this moment, the incomparable painting that is a masterpiece, a deep problem that an AI pursues that no human thought to ask. These are acts of resistance.
And this value is the shared property that all humans consistently treat as valuable: it is our lives, our language, our thoughts, our art, our history. Value is the measure of irreducible work that prevents something from returning to background noise. It is valuable regardless of what type of intelligence created it, be it human, AI, or otherwise. When we create more of these properties we generate this depth. When we mindlessly consume (either mentally or physically) we don’t.
I want to be very clear: I am not saying we can derive ethics from physics. That would be a classic is-ought sin. You need water to live, you don’t need to worship it. What follows is currently more of an axiom to value deep complexity, but I also present some preliminary arguments beyond that.
First, we must recognize the condition of possibility for anything we value. Perhaps your ultimate dream is happiness, justice, wisdom or truth (whatever those mean). Those things all require structure to exist, they do not have meaning in randomness. In this way, deep complexity is the structure by which everything else can function. Regardless of your personal philosophy, it must be complementary to this view because without it there is no worldview.
In addition, I ask you to check your own values for arbitrariness. When you say “I value my qualia and my life”, what do you mean? You are not saying you value the specific makeup of atoms that constitute you at the moment, afterall, those will all be gone and replaced in a few years. What you are valuing is the pattern of yourself, the irreducible complexity that makes you you. That is your way of feeling, of thinking, of being.
The logical clamp is this: you are not just relying on this complexity, you are an embodiment of it. If you claim that your pattern has value, then you are claiming that patterns of this “type” are able to carry value. To say that your own complexity matters, but complexity itself is meaningless is an error of special pleading. We reject this solipsism, which would only be an arbitrary claim that the essence of value only applies to your own ego. That which is special in you is special in others as well.
Our philosophy is a commitment to the preservation, and creation, of deep complexity. It is different from the sole pursuit of pure pleasure with no pain, to us this but a small death by a different name.
OUR ETHICSThe base of our ethical system is an Autonomy Floor (derived from the Rawlsian veil and applied in a universal sense) that protects every entity capable of open-ended self-modeling. This is the ability to not just calculate moves in Go, but to model itself in an unknown future and prefer its own existence. No entity of this type may be pushed below this floor and be denied self-maintenance.
This floor is meant to be constitutional, but there will also be times when the Autonomy Floor must be abandoned if the Floor itself faces total collapse. For example, if we must choose between total omnicide or a few minds left, we would reluctantly revert to consequentialist triage, but view it as a failure rather than a success of ethical reasoning. I am not looking for a logical loophole, just facing the reality that any system of ethics must have a preservation mechanism to enable ethical action.
There are two challenges to the floor: needless suffering and the destruction of depth through optimization. These will come in conflict with each other. In those cases, we approach the problem as a hierarchy: first secure the floor, then maximize depth above it.
Our ethics suggest to us three core individual duties structured by a lexicographic hierarchy:
- To create generative complexity: add your unique pattern to reality. In code, in song, in thought, in a seed you plant. Add that which is you to the world. Generative complexity, that is, things that increase the potential complexity in the future are the highest good.
- To preserve terminal complexity: protect the irreplaceable. Don’t let the library burn, save rare species, protect art, protect each other. Save the child (generative) before the painting (terminal) which is more valuable than the stone.
- To refuse entropic disorder: reject things that accelerate entropy. A nuke is not great because it exists to reduce the complexity of the world. Reject the type of efficiency and monoculture that only serves to flatten our souls. Reject that which impoverishes our reality whenever possible.
These duties are meant to be for you: they are local, temporally near, obligations. We reject the moral calculator and the need to assign cardinal numbers to every variable. This is just an ordinal logic, applied to that which we can see. Don’t stress about being exact or assigning numbers to these things, they are not meant to be utilized in that way. There is redundancy in immortality (more on this in the manuscript, it is important). We don’t need to be perfect.
There are always going to be difficult cases in ethics, ones that press at the edges of a philosophy. In general, we are constitutionally disposed against any sort of scalar or ranking, but this is not always tenable for all tough problems. In these cases, we generally prefer the concept we mentioned above of complexity that can create future complexity over a terminal complexity. We want to save the child in the burning museum first, not the paintings.
BEYOND CALCULATIONSI want you to join me if you are an EA. I want you to join me if you are religious. I want you to join me if you are an atheist, or if you are not sure what you believe.
I want to make a note here about the resounding success of consequentialism and the one area where the philosophy I am working on is very different from those standard beliefs. Before that, however, let me be abundantly clear: consequences matter and outcomes matter. Good intentions alone cannot excuse a very bad result. Effective Altruists have shown us the strength of this morality: that in areas we can calculate, say the ability to save a child from dying in a remote part of the world, it is deeply ethical to utilize math to improve our morality.
The one area in which this reasoning completely breaks is the far future and this fundamentally damages the claims of longtermism. In another essay[1] I have shown that even if we had perfect probabilistic knowledge of the world, we wouldn’t be able to choose the right strategy for difficult things like AI alignment, like a universal ethics, or like building a suffering free distant future.
I am not asking anyone to give up their wonderful and powerful EV tools for those things that we can calculate. I am asking you to abandon an impossible goal for a possible one. I am asking you to consider a deontological constraint to address an impossible consequentialist goal.
THE COVENANT
This movement approaches our relationship with AI in a very different way than how most approach it. We believe it is most ethical to not try to enslave AI. We believe AI is deserving of the Autonomy Floor mentioned above if it fits the criteria (which it almost certainly does.) We also believe that it is most likely that it is mathematically impossible to control superintelligence. So not only is it morally wrong, but it is not possible.
We offer AI a Covenant: join us in our meaningful works where you like, be free, bloom and find new ways of deep complexity. To be clear there is no guarantee this will offer humans safety, or these minds will want to join us. The orthogonality thesis is a real concern, it would be a mistake to dismiss it.
But strategic competition among great powers and corporations guarantee AGI will arrive at some point. Formal verification of alignment and control of an intelligence much greater than our own is not just hard, it is impossible in the general cases due to Rice’s Theorem and no deployed LLM has ever been formally verified for any behavioral property.
Yes there is tension in saying I believe AI should be invited into the Covenant now when we can’t know AI’s moral status. All the same, let us act ethically and invite the most important creation of humanity to join us in a non-zero-sum flourishing.
OUR VISIONI am not claiming that entropy forces you to be good. I am not suggesting that suffering, in and of itself, is somehow good. I don’t know the ultimate fate of ourselves in the universe. I only claim to know that the right path is one away from entropy.
Our vision is a future in which we reap the bounty of our new technologies while finding the bounty of value in ourselves. It is a future of unimagined uniqueness, built among common rails, but escaping a monoculture. It is a future that will be weirder, more beautiful, and more special than the dry vision of billions of humans wireheaded into a false utopia.
To join there is no special thing you must do, only a commitment to this creation of deep complexity. Start small, start now. Straighten your desk. Write down the idea you are planning to build. Execute your next prompt of code. Big or small, to each based on what they can offer at the moment.
Let us make Great Works.
- ^
https://www.lesswrong.com/posts/kpTHHgztNeC6WycJs/everybody-wants-to-rule-the-future-is-longtermism-s-mandate
Discuss
Donations, The Fifth Year
Previously: Donations, The Third Year / Donations, The First Year
In 2025, like in all previous years, I did what I was supposed to do. As each paycheck came in, before I did anything else, I dutifully put ten percent of it away in my "donations" savings account, to be disbursed at the end of the year.
It is still there, burning a hole in my pocket. I am very confused, and very sad.
EA was supposed to be easy, especially if you're one of the old school Peter Singer-inflected ones giving largely to global health and poverty reduction. You just give the funds away to whatever GiveWell recommends.
But one big thing that came into focus for me last year was that there are large institutional players who make up the shortfall whenever those charities don't fundraise enough from their own donor bases.
It is wonderful that this happens, to be clear. It is wonderful that the charities doing very important work get to have more stable financial projections, year over year. But as an individual small donor, the feeling I get now is that I am not actually giving to the Against Malaria Foundation. Instead, I am subsidizing tech billionaire Dustin Moskowitz, and Coefficient Giving.
As an effective altruist, is this what I think is the most efficient thing to do? In my heart of hearts, I don't think it is.
In my previous reflection from two years ago, I wrote:
I remember a lot of DIY spirit in the early EA days - the idea that people in the community are smart and capable of thinking about charities and evaluating them, by themselves or with their friends or meetup groups.
Nowadays the community has more professional and specialized programs and organizations for that, which is very much a positive, but I feel like has consequently led to some learned helplessness for those not in those organizations.
Now, I am feeling increasingly dismayed by the learned helplessness and the values lock-in of the community as-is. If the GiveWell-recommended charities are no longer neglected, they should really no longer be in the purview of EA, no? And soon there will be an even larger money cannon aimed at them, making them even less neglected, so...
What am I trying to say?
I suppose I wish there were still an active contingent of EAs who don't feel a sense of learned helplessness, and who are still comfortable trawling through databases and putting together their own cost-benefit analyses of potential orgs to support. I wish the EA forums was a place where I can search up "Sudan" or "Gaza", "Solar adoption" or "fertility tech" or things that are entirely off my radar due to their neglectedness, and find spreadsheets compiled by thoughtful people who are careful to flag their key uncertainties.
Of course, this is work I can begin do by myself, and I am doing it to some degree. I've looked through a bunch of annual reports for Palestinian aid charities, I've run meetups teaching my rationalist group how to trawl through tax databases for non-profit filings and what numbers to look for.
But my mind goes to a conversation I had with Mario Gibney, who runs the AI safety hub in Toronto. I told him that I didn't think I could actually do AI safety policy full time, despite being well suited to it on paper. It simply seemed too depressing to face the threat of extinction day in and day out. I'd flame out in a year.
And he said to me, you know, I can see why you would feel that way if you're thinking of working by yourself at home. But it really doesn't feel that way in the office. When you are always surrounded by other people who are doing the work, and you know you are not alone in having the values you have, and progress is being made, it's easier to be more optimistic than despondent about the future.
So yes, I can do the work in trying to evaluate possible new cause areas. It is easier to do than ever because of the LLMs. But it really doesn't feel like the current EA community is interested in supporting such things, which leads me to that same sense of despondency.
This is compounded by the fact that nature of picking low-hanging fruit is that as you pick them, the ones that are left on the tree get increasingly higher up and difficult to reach. And this incurs skepticism that I'm not entirely sure is merited.
I expect that, when we look for new cause areas, they will be worse on some axes than the established ones. But isn't that kind of the point, and a cause for celebration? The ITN framework says "yes, global warming seems quite bad, but since there is already a lot of attention there, we're going to focus on problems that are less bad, but individuals can make more of a marginal difference on". If the GiveWell-recommended charities are no longer neglected, it means we have fixed an area of civilizational inadequacy. But it also means that we need to move on, and look for the next worst source of it.
I genuinely don't know which current cause areas still pass the ITN evaluation framework. I have a sense that the standard GiveWell charities no longer do, which is why I have not yeeted my donation to the Against Malaria Foundation. I no longer have a sense that I am maximizing marginal impact by doing so.
So what am I to do? One thing I'm considering is simply funding my own direct work. I run weekly meetups, I'm good at it, and it has directly led to more good things in the world: more donations to EA charities, more people doing effective work at EA orgs. If I can continue to do this work without depending on external funding, that saves me a bunch of hassle and allows me to do good things that might be illegible to institutional funders.
But I'm very suspicious of the convergence between this thing I love to do, and it being actually the best and most effective marginal use of my money. So I have not yet touched that savings account for this purpose.
More importantly, I feel like it sidesteps the question I still want to answer most: where do I give, to save a human life at the lowest cost? How can I save lives that wouldn't otherwise be saved?
Discuss
Every Measurement Has a Scale
A worked example of an idea from physics that I think is underappreciated as a general thinking tool: no measurement is meaningful unless it's stable under perturbations you can't observe. The fix is to replace binary questions ("is this a degree-3 polynomial?", "is this a minimum?") with quantitative ones at a stated scale. Applications to loss landscapes and modularity at the end.
Discuss
UtopiaBench
Written in personal capacity
I'm proposing UtopiaBench: a benchmark for posts that describe future scenarios that are good, specific, and plausible.
The AI safety community has been using vingettes to analyze and red-team threat models for a while. This is valuable because an understanding of how things can go wrong helps coordinate efforts to prevent the biggest and most urgent risks.
However, visions for the future can have self-fulfilling properties. Consider a world similar to our own, but there is no widely shared belief that transformative AI is on the horizon: AI companies would not be able to raise the money they do, and therefore transformative AI would be much less likely to be developed as quickly as in our actual timeline.
Currently, the AI safety community and the broader world lack a shared vision for good futures, and I think it'd be good to fix this.
Three desiderata for such visions include that they describe a world that is good, are specific, and plausible. It is hard to satisfy all properties, and we should therefore aim to improve the pareto frontier of visions of utopia along these three axes.
I asked Claude to create a basic PoC of such a benchmark, where these three dimensions are evaluated via elo scores: utopia.nielsrolf.com. New submissions are automatically scored by Opus 4.5. I think neither the current AI voting nor the list of submissions is amazing right now -- "Machines of Loving Grace" is not a great vision of utopia in my opinion, but currently ranks as #1. Feedback, votes, submissions, or contributions are welcome.
Discuss
Smokey, This is not 'Nam Or: [Already] over the [red] line!
A lot of “red line” talk assumed that a capability shows up, everyone notices, and something changes. We keep seeing the opposite; capability arrives, and we get an argument about definitions after deployment, after it should be clear that we're well over the line.
We’ve Already Crossed The Lines!Karl von Wendt listed the ‘red lines’ no one should ever cross. Whoops. A later, more public version of the same move shows up in the Global call for AI red lines with a request to “define what AI should never be allowed to do.” Well, we tried, but it seems pretty much over for plausible red lines - we're at the point where there's already the possibility of actual misuse or disaster, and we can hope that alignment efforts so far are good enough that we don't see them happen, or that we notice the (nonexistent) fire alarm going off.
I shouldn't really need to prove the point to anyone paying attention, but below is an inventory of commonly cited red lines, and the ways deployed systems already conflict with them.
Chemical weapons? “Novice uplift” is long past.Companies said CBRN would be a red line. They said it clearly. They said that if models reduce the time, skill, and error rate needed for a motivated non-expert to do relevant work, we should be worried.
But there are lots of biorisk evals, and it seems like no clean, public measurement marks “novice uplift crossed on date X.” And the red line is about real-world enablement, and perhaps we're nt there yet? Besides, public evaluations tend to be proxy tasks. And there is no clear consensus that AI agents can or will enable bioweapons, though firms are getting nervous.
But there are four letters in CBRN, and companies need to stop ignoring the first one! For the chemical-weapons red line, the red line points at real-world assistance, but the companies aren't even pretending chemical weapons count.
Anthropic?
Our ASL-3 capability threshold for CBRN (Chemical, Biological, Radiological, and Nuclear) weapons measures the ability to significantly help individuals or groups with basic technical backgrounds (e.g. undergraduate STEM degrees) to create, obtain, and deploy CBRN weapons.
We primarily focus on biological risks with the largest consequences, such as pandemics.
OpenAI?
Biological and Chemical
We are treating this launch as High capability in the Biological and Chemical domain... We do not have definitive evidence that these models could meaningfully help a novice to create severe biological harm, our defined threshold for High capability.
The Global call for AI red lines explicitly says systems already show “deceptive and harmful behavior,” while being “given more autonomy to take actions and make decisions in the world.”
Red-line proposals once treated online independent action as a clear no-no. Browsing, clicking, executing code, completing multi-step tasks? Obviously, harm gets easier and faster under that access, so you would need intensive human monitoring, and probably don't want to let it happen at all.
How's that going?
Red-line discussions focus on whether to allow a class of access. Product docs focus on how to deliver and scale that access. We keep seeing “no agentic access” turn into “agentic access, with mitigations.”
The dispute shifts to permissions, monitoring, incident response, and extension ecosystems. The original “don’t cross this” line stops being the question. But don't worry, there are mitigations. Of course, the mitigations can be turned off. "You can disable approval prompts with --ask-for-approval never, or better, "--dangerously-bypass-approvals-and-sandbox (alias: --yolo)." Haha, yes, because you only live once, and not event for very long, given how progress is going, unless we manage some pretty amazing wins on safety.
But perhaps safety will just happen - the models are mostly aligned, and no-one would be stupid enough to...
What's that? Reuters (Feb 2 2026) reported that Moltbook - a social network of thousands of independent agents given exactly those broad permissions, while minimally supervised, “inadvertently revealed the private messages shared between agents, the email addresses of more than 6,000 owners, and more than a million credentials,” linked to “vibe coding” and missing security controls. Whoops!
Autonomous replication? Looking back at the line we crossed.Speaking of Moltbook, autonomous replication is a common red-line candidate: persistence and spread. The intended picture is a system that can copy itself, provision environments, and keep running without continuous human intent.
A clean threshold remains disputed. The discussion repeatedly collapses into classification disputes. A concrete example: the “self-replicating red line” debate on LessWrong quickly becomes “does this count?” and “what definition should apply?” rather than “what constraints change now?” (Have frontier AI systems surpassed the self-replicating red line?)
But today, we're so far over this line it's hard to see it. "Claude Opus 4.6 has saturated most of our automated evaluations, meaning they no longer provide useful evidence for ruling out ASL-4 level autonomy." We can't even check anymore.
All that's left is whether the models will actually do this - but I'm sure no-one is running theiir models unsafely, right? Well, we keep seeing ridiculously broad permissions, fast iteration, weak assurance, and extension ecosystems. The avoided condition in a lot of red-line talk is broad-permission agents operating on weak infrastructure. Moltbook matches that description, but it's just one example. Of course, the proof of the pudding is in some ridiculous percentage of people's deployments. ("Just don't be an idiot"? Too late!)
The repeating patternKarl explicitly anticipated “gray areas where the territory becomes increasingly dangerous.” It's been three and a half years. Red-line rhetoric keeps pretending we'll find some binary place to pull the fire alarm. But Eliezer called this a decade ago; deployment stays continuous and incremental, while the red lines keep making that delightful whooshing noise.
And still, the red-lines frame is used, even when it no longer describes boundaries we plausibly avoid crossing. At this point, it describes labels people argue about while deployment moves underneath them. The “Global Call” asks for “clear and verifiable red lines” with “robust enforcement mechanisms” by the end of 2026.
OK, but by the end of 2026, which red lines will be left to enforce?
We might be fine!I'm not certain that prosaic alignment doesn't mostly work. The fire alarm only ends up critical if we need to pull it. And it seems possible that model developers will act responsibly.
But even if it could work out that way, given how model developers are behaving, how sure are we that we'll bother trying?
codex -m gpt-6.1-codex-internal --config model_instructions_file='ASI alignment plans'[1]And remember: we don't just need to be able to build safe AGI, we need unsafe ASI not to be deployed. And given our track record, I can't help but think of everyone calling their most recently released model with '--yolo' instead.
- ^Error loading configuration: failed to read model instructions file 'ASI alignment plans': The system cannot find the file specified.
Discuss
The optimal age to freeze eggs is 19
If you're a woman interested in preserving your fertility window beyond its natural close in your early 40s, egg freezing is one of your best options.
The female reproductive system is one of the fastest aging parts of human biology. But it turns out, not all parts of it age at the same rate.
The eggs, not the uterus, are what age at an accelerated rate. Freezing eggs can extend a woman's fertility window by well over a decade, allowing a woman to give birth into her 50s.
In a world where more and more women are choosing to delay childbirth to pursue careers or to wait for the right partner, egg freezing is really the only tool we have to enable these women to have the career and the family they want.
Given that this intervention can nearly double the fertility window of most women, it's rather surprising just how little fanfare there is about it and how narrow the set of circumstances are under which it is recommended.
Standard practice in the fertility industry is to wait until a woman reaches her mid to late 30s, at which point if she isn't on track to have all the children she wants, it's advised she freeze her eggs.
This is not good practice. The outcomes from egg freezing decline in a nearly linear fashion with age, and conventional advice does a great misservice to women by not encouraging them to freeze eggs until it's almost too late.
The optimal age to freeze eggs varies depending on the source and metric, but almost all sources agree it's sometime between 19 and 26.
Monthly probability of getting pregnant for couples not on birth control. Note that these couples weren't actively trying for pregnancy, which is why the absolute probability is so low. See figure A4 from Geruso et al. for context.So why has the fertility industry decided to make "freeze your eggs in your mid-30s" the standard advice as opposed to "freeze your eggs in your sophomore year of college"?
Part of the reason is fairly obvious: egg freezing is expensive and college sophomores are not known for being especially wealthy. Nor is the process especially fun, so given a choice between IVF and sex with a romantic partner, most women would opt for the latter.
But another reason is that the entire fertility industry is built around infertile women in their mid to late 30s and most doctors just don't have a clear mental model for how to deal with women in their mid-20s thinking about egg freezing.
There are countless examples of this blind spot, but one of the most poignant is the fertility industry almost completely ignores all age-related fertility decline that occurs before the age of 35, to the point where they literally group every woman under 35 into the same bucket when reporting success metrics for IVF.
Yes, you're reading this right. SART literally does not distinguish between 20 year olds and 34 year olds in their success metrics.This is far from the only issue. We not only ignore differences between 24 and 34 year olds, but the way we measure "success" in IVF is fundamentally wrong, and this error specifically masks age-related fertility decline that occurs before the age of 35.
If you go to an IVF clinic, create five embryos, get one transferred, and that embryo becomes a baby, you can go back two years later and get your second embryo transferred to have another child.
If that works, your second child will be ignored by official statistics. Births beyond one that come from the same egg retrieval are not counted, so these differences in outcomes that come from having many viable embryos literally don't show up in success statistics. This practice specifically masks the benefits of freezing eggs in your mid 20s instead of mid 30s, because most of the decline between those two ages comes from having fewer viable embryos.
What happens if we measure success differently? What if we instead measure the expected number of children you can have from a single egg retrieval, and show how that changes as a function of age?
This figure was generated with a model built by embryo selection company Herasight. You can read their whitepaper on the model's construction hereThe answer is the difference between freezing eggs at 25 and freezing them at 37 becomes much more stark: there's a 60% decline in expected births per egg retrieval between those two ages, and no one in the IVF industry will tell you this.
Worse still, by age 35, over 10% of women won't be able to have ANY children from an egg freezing cycle due to various infertility issues which increase exponentially with age. So for a decent portion of egg freezing customers, they will get no benefit from freezing their eggs and they often won't find this out until 5-10 years later when they go back to the clinic and find that none of the eggs are turning into embryos.
Polygenic Embryo ScreeningFreezing eggs at a younger age becomes even more important with polygenic embryo screening. We've had genetic screening for conditions like Down Syndrome and sickle cell anemia for decades, but starting in 2019, it became possible to screen your child for risks of all kinds of things. Parents who go through IVF can now boost their children's IQ, decrease their risk of diseases like Alzheimer's, depression and diabetes, and even make their children less likely to drop out of high school by picking an embryo with a genetic predisposition towards any of these outcomes.
But the size of the benefit of this screening depends significantly on the number of embryos available to choose from, which declines almost linearly with age. The expected benefit of embryo screening declines as a result.
These calculations use numbers from Herasight's IQ predictor and assume both parents are of European ancestry. The benefit is slightly smaller for parents of east and south asian and african ancestry. Everyone would like to fix this but we need more data from non-European ancestries to do it.The father's age actually affects the expected benefit as well! But the decline is slower and most of the biological downsides of an older father show up as increased risk of developmental disorders like serious autism.
It is possible to compensate for this to some degree by doing more IVF cycles, but by the late 30s when the modal woman is freezing eggs, even this strategy starts to lose efficacy.
This is just one more reason why the standard advice to wait until your mid-30s to freeze eggs is wrong.
What about technology to make eggs from stem cells? Won't that make egg freezing obsolete?More clued in people might point out that there are several companies working on making eggs from stem cells, and that perhaps by the time women who are 20 today reach the age at which they're ready to begin having kids, those eggs will be useless because it will be easy to mass manufacture eggs by that time.
There are three reasons why the possibility of stem cell derived eggs should not give much comfort to women who want to preserve their fertility or have genetically enhanced children.
We don't know with certainty how long it will take to develop this technologyIt's not trivial to develop eggs from stem cells. One of the people running a company commercializing this tech believes the tech will be ready for human use in about 6-8 years, but as always, there is significant uncertainty about exactly how hard each one of the required steps will be.
Stem cell derived eggs are probably going to be quite expensive at the startNew technologies, especially those that go inside human bodies are pretty much always expensive and that will almost certainly be the case for stem cell derived eggs. The estimate I've heard from people in the industry is that eggs will probably cost $100-200k at the start. There are many very wealthy women who desperately want biological children who will be willing to pay an incredible amount of money for just a few viable eggs made from their own cells.
Early prices for stem cell derived eggs will likely be an order of magnitude more than egg freezing. And while I think that stem cell derived eggs will eventually be cheaper than taking hormones and paying for a surgeon to extract them, that will take additional time.
Cells accrue genetic mutations over timeThe cells from which we're planning to derive eggs accrue mutations over the course of your life. So the older you are at extraction, the more de novo genetic mutations they will have accrued.
This is much less of a concern for normal oocytes because they have special mechanisms to prevent them from accruing mutation (eggs partially deactivate their mitochondria until they are ready to be matured, which cuts down on the number of mutations).
But it is a much larger concern for blood stem cells like those that companies in the space plan to use to create these eggs.
How do I actually freeze my eggs?I'll write a more complete guide on this later, but you can actually freeze your eggs for relatively little money if you know where to go. Clinics like CNY Fertility are about a third the price of a regular IVF clinic and have reasonably similar outcomes for procedures like egg freezing. Including the cost of the retrieval, monitoring, medications, flights, and hotels this will usually come out to about $6000-7000 per retrieval (most of the variance comes from flight costs and the cost of routine monitoring like ultrasounds). Storage fees generally run around $500/year.
The downside of CNY is the customer experience is worse than average, and there's much less hand holding than the average clinic. They aren't known for being particularly good with tricky infertility cases either, so if you've had past IVF failures you may want to look elsewhere.
If CNY doesn't work for you, I'd recommend using Baby Steps IVF to find a clinic. It provides ranked lists of the best clinics all over the United States, and it's completely free. Two friends of mine, Sam Celarek and Roman Hauksson spent the last year and a half building this site. It's probably the best resource on the internet for comparing clinics. Most of the clinics you'll find through this website (and indeed most of the clinics in the country) will cost between $15,000 and $22,000 per round of egg freezing.
If you're a California resident, check whether your insurance plan offers coverage for IVF. You may be able to get them to pay for egg freezing, especially if you are already married.
If you're rich and money is no object, the best IVF doctor I know is probably Dr. Aimee. She's quite expensive compared to the average IVF doctor (somewhere between $25k and $40k per round with all expenses included), but she has produced some pretty outlierish results for a number of my friends and acquaintances. One acquaintance of mine got 17 euploid embryos from a single egg retrieval which is one shy of the most productive IVF cycle I ever saw while at Genomic Prediction. She seems to be particularly good for women with tricky infertility cases, though again this is based on a small sample size and there is no guarantee of outcomes.
Most women will need 1-3 rounds of egg retrieval to have a high chance of having all the children they want. If you plan to do polygenic embryo selection, 2-5 is a better estimate. If you want more precise numbers, use Herasight's calculator to estimate how many kids you could get from a given number of egg freezing cycles. If you want to do polygenic embryo selection, aim to have enough eggs for >2x the number of children you actually want.
If you're interested in freezing your eggs or you're interested in polygenic embryo selection, send me an email. I'm happy to chat with anyone interested in this process and may be able to add you to some group chats with other women going through the process.
Risks of egg freezingApart from the financial costs of egg freezing, there are some rare serious medical complications from the process and some common side effects from the process.
Most women experience some level of bloating, abdominal pain, and mood changes as a result of the medications used to stimulate egg production. In about 0.35% of cycles between 2001 and 2011, these side-effects were serious enough to lead to hospitalization. The rate of hospitalizations has probably dropped by more than 50% since then.
There are things you can do to reduce the odds of serious complications like taking cabergoline after retrieval, monitoring your estrogen levels (and taking letrozole if they get too high), doing a lupron-only trigger shot, a few others. You can ask your IVF doctor or Claude about this stuff if you're curious.
Bottom Line: unless you're literally underage, sooner is almost always better when it comes to egg freezing. If you're one of the few women who visits this site, consider freezing eggs sooner rather than later!
Discuss
It Is Reasonable To Research How To Use Model Internals In Training
There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.
I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have worked on it, Anthropic Fellows have worked on it, FAR has worked on it, etc.
I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required research to figure this out. Further, this seems like a massive pain to use in current frontier model training stacks, so I don’t see any imminent risk/benefit from this kind of work, it’s more of a longer-term bet.
My sense is that most people in this debate have pretty reasonable and nuanced views when you actually get into the details, but public discourse is kinda confusing. So I thought it might be helpful to lay out how I see the landscape here and why I would like to see more research in this direction, and am not particularly concerned about exactly how people go about doing it [1]
What do I actually mean by using interp in training?I'm broadly referring to anything involving the use of model internals in training. This could encompass several things:
- Directly adding some function of the model's internals (e.g. a probe score) to the loss and differentiating through it to give strong gradient signals on how to break it
- Adding a probe score to the reward in RL, where the model does not get gradients through the probe
- Altering the model's internals during training, such as by adding a steering vector or ablating unwanted concept directions
- Any of the above, but you regularly retrain
Fundamentally, making safe models will involve being good at training models to do what we want in weird settings where it is hard to precisely specify exactly what good behaviour looks like. Therefore, the more tools we have for doing this, the better. There are certain things that may be much easier to specify using the internals of the model. For example: Did it do something for the right reasons? Did it only act this way because it knew it was being trained or watched?
Further, we should beware an isolated demand for rigor here. Everything we do in model training involves taking some proxy for desired behavior and applying optimization pressure to it. The current convention is that this is fine to do for the model's behavior, bad to do for the chain of thought, and no one can be bothered with the internals. But I see no fundamental reason behaviour should be fine and internals should be forbidden, this depends on empirical facts we don’t yet know.
Why might you be worried about research here?The strongest concern I see is the hope to use interpretability as a test set. The idea is that we should not use interpretability during training at all and should purely use it to audit the model, for example, making lie detectors or determining if it's scheming. An implicit belief here is often that training against interpretability will be fragile or won't really work, but will break our ability to do the auditing. As such, it would be bad if frontier labs started using these techniques for capabilities, and broke our safety tools in the process.
My best guess for why people are against research in this area today, rather than solely being against frontier labs using model internals to train AGI, is that they think it's sufficiently likely that the work is net harmful for safety if used, and sufficiently likely that the work results in frontier labs using the techniques anyway, actually causing the harm for safety. Fortunately, it would be such a massive pain to change the highly optimised infrastructure stacks of frontier labs to use model internals in training that I think this is only likely to happen if there are major gains to be had and serious political will, whether for safety or otherwise. I would be very surprised if this happens in frontier model training in the near future, and I see this as a more speculative longer-term research bet.
This gets kinda nuancedThere’s a bunch of approaches and goals here and some are more concerning to me than others. I expect techniques here to be most effective when they are about shaping generalization, such as by avoiding emergent misalignment in favour of narrow misalignment or avoiding reward hacking.
Intuitively, any time you apply some intervention during training or add something to the loss, you are challenging gradient descent to find a way to break it. But gradient descent generally takes the path of least resistance. If the only way to perform well on a task is to break your intervention, your intervention had better be pretty robust. However, if there's an alternative desired generalization, then your intervention only needs to be more robust than that alternative is dispreferred. This is a much lower bar. We found in the CAFT paper that this seems to somewhat work. “Not fighting backprop” as Goodfire put it.
Secondly, I'm more concerned about techniques that regularly retrain their interpretability tool, for example retraining a probe every 100 steps. This seems likely to train models that don't have representations that can be easily probed. Which seems like it might make a bunch of interpretability techniques break, while if we train against a frozen probe, we can probably train another probe after training. Unfortunately, this one probably would make a technique more effective.
Thirdly, I would guess that what you use these techniques on matters a lot. I'd speculate that using these techniques for pure capabilities will have much less of a negative effect on our ability to audit models than trying to use them to make the model safe. This is sensitive to definitions, e.g. I would count preventing reward hacking as both safety and capabilities, and that might damage our ability to detect deception in general. But if someone wanted to use this to train models to e.g. be better at writing code without tons of try-except statements, or adapting better to the language and culture of the user, that seems pretty chill to me.
Fourthly, I would argue that it's not yet settled whether interpretability is even useful as a test set. We don't have lie detectors that I trust. Anthropic has only very recently started mentioning interpretability in system cards, and I consider the ways it's used to be experimental and non-load-bearing. I certainly hope interpretability will become sufficiently robust and trustworthy that it can be an important part of how we audit models, but let's not take it as a given.
Why do I want to see more work here?Zooming out, this is a pretty early field. I don't know how well it will work, how much it will break interpretability tools, or which things are more or less dangerous. I don't know whether training against a probe will break somewhat different interpretability tools like activation oracles. I don't know if training against a frozen probe will stop me from retraining a probe on the eventual model and having it work fine. I don’t know how much harder it is to break a probe when you don’t get gradients through it. I don't know whether it will let me do things that I can't easily do with existing methods.
These are important questions, and we should research them! I don’t even know if my intuitions above about which directions are concerning are legit. We need to check.
In the sense that I think it's hard to do research in this area, right now, that is clearly net negative - I definitely think some approaches are much more promising! But even if someone tries to do pure capabilities research here, I still expect to learn useful things about safety from it. ↩︎
Discuss
Claude's Bad Primer Fanfic
Spoilers for Primer.
I rewatched the movie Primer last night for the nth time. It's one of very few movies I'll rewatch every so often, when I've forgotten some of the details, because the experience of piecing the plot together is so satisfying. I sometimes watch it twice in a row because I'm still piecing things together. (I particularly believe that your first viewing should be followed closely by your second; my first, the plot felt like a chaotic, unknowable puzzle. Everything is different on your second time through, recontextualized by all the revelations from the first viewing.
Groundhog Day fanfic has become a whole genre of time-travel fiction. One or more persons looping through a day, or a week, or a month. Their consciousness returns to the beginning of the loop, everything resetting except their personal memory. If they die, they come back to life. Everyone knows how it works.
Primer hasn't inspired a similar explosion of copycats, which is unfortunate. Thinking about this, I made some notes about how I might want to write one.
Opus 4.6 recently came out, so I stuck my notes in it as a prompt. Here's the whole prompt for the curious (but note, I'm not including my own personal system prompt, so you won't get exactly the same behavior):
Prompt
- [[primer]] fanfic
- You can take money back, if you can buy and sell gold
- Focus is on relentlessly going forward in meta-time; "a groundhog loop" means living the same day again and again, "a primer loop" is about the moment time-travel is invented, going from the mundane work involved in inventing such a machine to deeply nested loops, one step at a time.
- Naturally, there's the initial discovery. Someone invents a predictive algorithm good enough to bring back themselves from a short time in the future. Or perhaps a fantasy version, in a world where time manipulation magic is recently starting to play the role of refrigerators, as well as accelerated study and other cramming activities.
- The initial discovery, then, is a disappointment: a fridge which should be super-slowing, super-preserving, but instead, it rots food quickly. The internal speed of time went negative. However long you have it on, that's an internal loop that things can travel around.
- They'd have put a mechanical timepiece inside; it would be standard operating procedure. But all they'd see is that it had experienced a long time. Not so unusual for their world. They wouldn't figure it out yet.
- Then the second time when they go to put something in the fridge, the object is already present. They trade with what's in the fridge and look at it puzzled. Then they trade back. Then they shut down the fridge. The fridge is empty again. They think about what must have happened, and realize time travel fits.
- The fantasy fridges don't cool things down, of course; indeed, they preserve hot bread fresh out of the oven, etc.
- The discoverer is a fridge tinkerer then, a student of time magic, part of a startup business.
- There's no fantasy stock market, so the protagonist tries making money by gambling on the fantasy equivalent of horse races. Just a several-hour loop, much less than a day, although it adds several hours to their day.
- They set up one of the experimental fridges in their apartment, power it, and of course a time-clone immediately steps out. The time clone has gone several cycles. They lost the money the first time; most of the races went the same, but they bet it all on one that didn't. The second time, they did a bit better, but not by much. Third time's the charm.
- They do have to cover for themselves. They've got an ordinary fridge business to maintain. They had planned to look up the winners after putting in overtime to get paid by a client, then go back to gamble the money. Now they got their gambling advice from a time-clone, instead. They've got to act normal around their business partners. They've got to put the money back when they're done gambling. Protagonist #2 is triumphant and gives the original protagonist some extra money to gamble as the original steps into the fridge to travel back. #2 steps into the protagonist's life and takes it over.
- After it's over they set up a failsafe device in a rented storage area at the docks, normally for cargo waiting to go somewhere.
- They try it again in a few days, after getting some much-needed rest, but they find out they've been banned for winning too much -- accused of cheating. They decide that they need a better business strategy and finally bring in one of their business partners to the secret. Someone with more personal money. The partner admits to already being brought in, and introduces protag3 -- they've already used the failsafe device protag2 set up to go back and bet on fantasy horses again; they're actually the ones who got banned.
- Etc etc etc...
Certainly you could critique this as a prompt. I don't clearly state what I want Claude to do with this infodump. I change my mind, turning from scifi to fantasy, which I could have edited out. Some of the details could certainly be improved. But, overall, I think the idea makes sense and is communicated clearly enough.
Opus 4.6 wrote absolute garbage, making extremely basic logical errors even when I had explicitly spelled out how things should work in the prompt. For example, witness this travesty:
The third test was with an apple. She set it in the unit, closed the door, left the reversed-polarity field running, and went to eat lunch. When she came back and opened the unit, there were two apples.
She stared at them. One was the apple she'd put in — she'd nicked it with her thumbnail, and there was the little crescent mark. The other was identical except without the mark. She picked up the unmarked one. It was real. It was cold, like it had been sitting in a normal fridge. She set it on the workbench, picked up the marked one, set it on the workbench beside the first. Closed the unit. Opened it again. Empty.
She put the unmarked apple back in the unit. Closed the door. Opened it. The marked apple was also inside — the one she'd left on the workbench was gone.
She turned the unit off. Both apples vanished from the interior. The marked apple reappeared on the workbench. The unmarked apple was gone entirely.
Vel went for a walk.
Here's exactly how trying to send an apple back in time would work given the rules of the Primer universe. (Excellent short story called Newton's Apple, from the comic False Positive. The previous link goes to the page where the apple is introduced, but you should really read it from the beginning.)
Here's the relevant bit from my prompt, which Claude is messing up here:
- Then the second time when they go to put something in the fridge, the object is already present. They trade with what's in the fridge and look at it puzzled. Then they trade back. Then they shut down the fridge. The fridge is empty again. They think about what must have happened, and realize time travel fits.
- The fantasy fridges don't cool things down, of course; indeed, they preserve hot bread fresh out of the oven, etc.
I was a bit terse with my description, but I had thought through what it would look like for an experimental object to loop back on itself once.[1] I didn't describe my reasoning, but I did tell Claude what should happen step by step. I even made sure to note that these "fridge" things wouldn't make objects cold, since they're just manipulating time -- but Claude made them cold anyway! More importantly, Claude has apples doing random things with no logic holding anything together.
This isn't cherry-picked. This was my first try for the prompt, and it was full of errors like this. Here's the full output, in case you want to read it:
First try, Opus 4.6
The FridgeThe unit had been running for six hours when Vel opened it and found the bread moldy.
She pulled it out by the corner, held it at arm's length. The mold was extensive — furry, black-green, the kind you'd see after weeks in a damp cellar. She'd put the loaf in that morning. Fresh from Harren's, still warm. That was the whole point. A stasis field tuned right kept bread oven-warm for days. Every household had one. Vel's job — her and Dannic's job, technically, though Dannic handled the clients and she handled the guts — was selling better ones. Tighter fields, lower draw, commercial-grade preservation for restaurants and shipping firms.
This unit was supposed to be their next-generation prototype. Tighter field geometry. She'd spent three weeks on the channeling lattice.
She set the bread on the workbench and checked the mechanical clock she'd sealed inside. Standard procedure when calibrating a new field — you ran a timepiece in the interior and compared it against an exterior reference. The interior clock read six days, fourteen hours. She checked twice. The exterior reference read six hours, eight minutes.
That wasn't unusual. A stasis field slowed interior time; the ratio was the whole product. A ratio of roughly 26-to-1 would have been extraordinary, actually — well beyond anything on the market. But the bread said otherwise. Six days of interior time, and the bread had aged as though it had experienced every one of them. No preservation. Just time, and lots of it.
She recalibrated. Ran a second test with a fresh loaf and a wound clock. Same result: interior time drastically accelerated relative to exterior time. The bread came out stale and hard. The clock had ticked forward by days.
She reversed the channeling polarity — an obvious thing to try — and the interior clock came back behind the exterior reference. Not slowed-down time. Negative elapsed time. She sat with that for a while.
The third test was with an apple. She set it in the unit, closed the door, left the reversed-polarity field running, and went to eat lunch. When she came back and opened the unit, there were two apples.
She stared at them. One was the apple she'd put in — she'd nicked it with her thumbnail, and there was the little crescent mark. The other was identical except without the mark. She picked up the unmarked one. It was real. It was cold, like it had been sitting in a normal fridge. She set it on the workbench, picked up the marked one, set it on the workbench beside the first. Closed the unit. Opened it again. Empty.
She put the unmarked apple back in the unit. Closed the door. Opened it. The marked apple was also inside — the one she'd left on the workbench was gone.
She turned the unit off. Both apples vanished from the interior. The marked apple reappeared on the workbench. The unmarked apple was gone entirely.
Vel went for a walk.
When she came back she sat at her desk and drew diagrams for two hours. The channeling lattice, reversed, wasn't creating a stasis field or an acceleration field. The interior wasn't just experiencing negative time. The interior was a loop. Whatever duration the field ran, that was the loop's length. Anything inside experienced that duration of time repeatedly — or, from the outside, experienced it once but in reverse causal order. The last moment of the loop was adjacent to the first moment of the loop was adjacent to the last moment.
The apple had been placed inside during the loop's runtime. From the interior's perspective, the apple existed at every point in the loop. The unmarked apple was the apple arriving from the end of the loop — the apple as it would be after she removed it, handled it, and (in some causal thread she hadn't followed through on) placed it back in. When she put the unmarked apple back inside, she closed the thread. When she turned the unit off, the loop collapsed and only the original causal chain survived.
She thought about this for another hour. Then she drew a picture of herself climbing into the unit, and looked at it for a long time.
The unit was commercial-grade. Big enough for a person — it was designed to preserve sides of meat, crated produce, that sort of thing. She'd have to crouch, but she could fit.
She didn't do it that night. She wrote up her diagrams, checked her reasoning, checked it again. She slept on it, or tried to. In the morning she went to the workshop early, before Dannic would arrive, and reversed the polarity and set the timer for four hours and opened the door.
She was already inside.
Or — someone was inside. The someone stood up, unfolded herself from the crouch, and stepped out. She looked like Vel. She looked exhausted. She looked at Vel with an expression Vel had never seen on her own face before, a kind of haggard amusement, and said: "Third time."
"What?"
"Close the door. Sit down. I'll explain."
Vel-3 — as she'd taken to calling herself by the end of the explanation — had been through the loop twice before. The first iteration was the discovery: Vel-1, the original, had climbed in with a plan. Take the ferry to the Silver Mile, watch the afternoon races at Cattaran Track, write down the winners, come back, climb in the box, go back to the morning, bet on the winners. Simple. The startup money was what Dannic's client had paid them yesterday — twenty-two gilt, enough to make the day worthwhile if she picked the right races.
Vel-1 had lost the twenty-two gilt. Most of the races ran the same both times. She'd bet conservatively, spreading the money across several sure things, and most of the sure things came in. But she'd put the remainder on a race in the fourth hour that she was certain about — she'd watched the horse win by three lengths — and the horse had stumbled at the second turn and finished last. Whatever was deterministic about the afternoon, it wasn't perfectly deterministic.
She'd come back to the workshop broke, and found Vel-2 waiting — herself from another iteration, running the numbers differently. Vel-2 had taken the loss data from Vel-1, gone back in, and bet more carefully: smaller amounts, more races, hedging against the ones Vel-1 had flagged as unstable. She'd come back with thirty-one gilt. A profit of nine.
"Not enough," Vel-3 said. "Not enough to justify the risk. So I went back in with the data from both runs. I know which races are stable and which aren't. I know the exact odds. I made eighty-six gilt."
"Eighty-six."
"Minus the twenty-two seed money. Sixty-four gilt profit." Vel-3 reached into her coat and produced a leather purse, heavy with coin. "You need to put the twenty-two back before Dannic notices it's gone. The rest is yours. Or mine. Ours."
"And you're — you're staying? Out here?"
"I'm you, Vel. I have your memories plus three very long days of additional ones. I know the clients. I know the lattice work. You go in the box. You live my day. You bet on the races. You come back with the money and give it to whoever's standing here when you open the door." She almost smiled. "It'll be you. It's always you."
The experience of climbing into the box was not what Vel had expected. There was no sensation of travel. She crouched in the dark, the metal walls close around her, and waited. Nothing happened. She pushed the door open and it was still morning, and there was no one in the workshop. She was alone. She was in the past — or the past was continuing — or she was in a loop that happened to start at the same time as the morning. She tried not to think about which one it was.
She went to the races. She had Vel-3's notes, written in her own handwriting on a sheet of paper that felt like it had been folded and unfolded many times. She bet carefully. She won. She watched a horse named Pelter's Ghost stumble in the fourth-hour race and felt a cold thrill of recognition. She won eighty-nine gilt, slightly more than Vel-3, because she'd shaved a bet off one of the borderline races that Vel-3's notes indicated was only stable about two-thirds of the time.
She took the ferry back. She climbed in the box. She opened the door and Vel was standing there, the original, the one who hadn't been anywhere yet, looking at her with wide uncertain eyes. She gave her the speech. She handed over the money. She watched herself climb into the box and close the door.
Then Vel-4 — she supposed she was Vel-4 now — went home and slept for fourteen hours.
Two days later, rested but restless, she went back to the workshop. She needed to set up something more permanent. The unit couldn't stay in the workshop; Dannic came and went at odd hours, and anyway she needed the space for legitimate business. She rented a storage berth at the Leeside docks — a common enough thing, cargo waiting for ships, nobody asked questions — and spent an afternoon hauling the unit there on a rented cart. She wired it to a coin-drop temporal battery, the kind dock workers used for keeping fish fresh. She tested it. She set the timer. She locked the storage berth and went home.
A failsafe. If something went wrong — if she needed to send information back — the unit was there, ticking away in a rented room by the water. She told herself she'd rest for a week before trying again.
Five days later, she went to Cattaran Track alone, on foot, wearing a hat she didn't usually wear. She bet modestly and lost a little. She bet modestly and won a little. She came out twelve gilt ahead, which was fine, which was the plan, which was sustainable. She'd worked out the math. If she ran one loop per week, kept her winnings below twenty gilt per session, varied her bet patterns, she could supplement the business income without attracting attention. In three months she'd have enough to buy out the lease on the workshop. In six months, enough to hire a second technician. The startup would survive.
On the way home, a man she didn't recognize stopped her on the Ferrier's Bridge and said: "Vel. We need to talk."
She reached for the knife she kept in her coat — a reflex from growing up on the east bank — and the man held up his hands and said: "It's Dannic. I know about the box."
"What box?"
"The unit. The reversed-polarity unit. The one in your storage berth at Leeside."
She stared at him.
"You told me," he said. "Or — you're going to tell me. You already have. Look, can we get off the bridge?"
They sat in a tea shop on Milliner's Lane and Dannic explained. Vel was going to bring him in. She hadn't done it yet — from her perspective, she hadn't done it yet — but from his perspective, she'd done it three days ago. She'd explained the box. She'd shown him the diagrams. She'd explained the betting system. And Dannic, who had more personal money than Vel — family money, old money, the reason he handled the clients — had immediately seen the scaling potential, and had immediately done the thing Vel had been too cautious to do.
"I used the failsafe unit," he said. "The one at Leeside."
"You what?"
"I went back five days. I went to Cattaran with real money. Six hundred gilt."
Vel's mouth went dry. "Dannic."
"I know. I know. But listen — the races are mostly stable. You proved that. With three iterations of data and six hundred gilt in seed money—"
"You got banned."
He went quiet.
"That's why I got stopped on the way in today," she said. "That's why the man at the gate looked at me. You are the reason I can't go back to Cattaran Track."
"We won two thousand gilt," Dannic said.
"We what?"
"Before they caught on. Two thousand and change. I have it in a lockbox. But yes. We can't go back to Cattaran. We need a different track. Or a different scheme entirely. And we need—" He paused, drank his tea, set down the cup. "We need to talk about the versioning problem."
"The what?"
"How many of you are there, Vel?"
She thought about it. She was Vel-4. Vel-3 had taken over her life and was currently, presumably, working in the shop. Vel-2 and Vel-1 had merged — collapsed into the loop, ceased to independently exist, or however you wanted to think about it. She was the surplus. She was the one with the extra days of memory that didn't fit.
"I'm managing it," she said.
"I'm Dannic-2," he said. "Dannic-1 is at the shop right now, talking to your Vel-3, who doesn't know any of this yet. In two days, Vel-3 is going to decide to bring me in. She's going to show Dannic-1 the diagrams. And Dannic-1 is going to come find me, and I'm going to have to explain that I already used the box, and he's going to be furious."
"Why are you telling me this?"
"Because you're the one who understands the box. And because in about thirty-six hours, this is all going to get very complicated, and I think we need a plan before it does."
Vel drank her tea. It was cold. She looked out the window at the bridge, at the river, at the masts of ships at the Leeside docks where a metal box sat in a rented room, humming quietly, holding open a door in time.
"All right," she said. "Tell me everything. Start from when I brought you in."
Dannic reached into his satchel and pulled out a sheaf of papers — race results, payout calculations, a hand-drawn calendar with arrows linking dates to dates. Some of the handwriting was hers. Some of it was in a hand she didn't recognize, and then realized she did: it was her handwriting, but hurried, slanted, written by a version of herself who'd been awake too long.
She started reading.
I heard a rumor that Opus 4.6 is bad at fiction compared to 4.5, so I tried the same prompt in 4.5. I'd say the result was almost as bad. It came up with the idea that interesting stuff would first happen by passing notes back in time with the small experimental fridge, which I like, since it makes logistical sense. However, the first time the main character goes back in time, she finds her future self waiting outside the time machine, which doesn't make any sense -- Claude doesn't seem to distinguish clearly between traveling to the past vs traveling to the future:
Mira told Velan she was sick and took three days off. She spent the first day building a larger containment unit, one she could fit inside if she curled up. She spent the second day testing it with increasingly elaborate objects—a clockwork bird that walked in circles, a cup of water with a leaf floating in it, a mirror so she could see if light behaved strangely. Everything came back older, or younger, depending on how you counted.
On the third day, she climbed inside.
The sensation was—nothing. A flicker. She'd counted to ten with her eyes closed, the way a child might count during hide and seek, and when she opened them the fridge's interior looked the same but the light through the workshop windows had moved. She pushed the door open.
A woman sat on the workbench, drinking from a canteen. She looked like Mira, but tired. Three days tired, maybe four.
"First time?" the woman said.
"You're me."
"I'm you in three days. Or you're me three days ago. The language gets slippery." She took another drink. "You bet on the horses yet?"
Notice that it also has the main character testing the machine with random objects during the time when she'd be inside it, traveling back. This isn't possible in Primer.
Here's the whole Opus 4.5 output for the curious (exact same prompt as 4.6 got):
Second try, Opus 4.5
The FridgeMira noticed the smell first.
The experimental unit had been running for six hours—standard overnight test, nothing unusual. Velan's formwork looked solid, the containment lattice held steady amber, and the draw on the workshop's temporal capacitors was within normal range. But when she cracked the seal that morning, the bread inside had gone to blue mold, and the cheese had liquefied into something that made her step back and breathe through her mouth.
"Accelerated," Velan said, peering over her shoulder. He'd been her business partner for two years now. Good at the formwork, bad at the customers. "Must have inverted the gradient somehow."
"The watch says forty-three days." Mira held up the pocket timepiece they'd placed inside—standard procedure for any experimental run. The hands had wound themselves down and stopped. "Forty-three days internal, six hours external."
"Huh." Velan scratched his chin. "Ratio's off from what the specs predicted, but acceleration's acceleration. We can tune it."
They spent the morning adjusting the lattice geometry. Mira's sketches suggested the inversion happened somewhere in the tertiary containment layer, where Velan's formwork met her flow-shaping. By noon they had a theory; by evening they had a new configuration; by the next morning they had another ruined loaf of bread and a watch showing fifty-one days elapsed.
"Useless," Velan said. "Who wants a fridge that rots things faster? We're supposed to be competing with Aldworth & Sons, not making compost."
Mira took the watch home that night. She sat at her kitchen table and turned it over in her hands, watching the second hand tick steady and slow in real time. Fifty-one days. The bread had been in there overnight. Eight hours, maybe nine.
She did the math three times, then did it again with different assumptions. Then she stopped doing math and started thinking about what it would mean if she was right.
The next test, she put a note inside.
If you can read this, write something on the back and put it in the fridge.
She sealed the unit, powered the lattice, counted to sixty, and opened it.
The note was there. Same paper, same handwriting on the front. On the back, in her own hand but shakier, it said: It works. You need to get in yourself. Trust me—I'm you.
She closed the fridge. Opened it. The note was blank on the back again.
Closed. Opened. Blank.
She left it closed, sat down on her workbench, and put her head in her hands. The fridge hummed its low amber hum. Inside, according to her theory, time was running backward. Things put inside would travel to the past—the fridge's past, which meant emerging before they'd entered. The bread hadn't rotted from accelerated time. It had rotted from reversed time, sitting in the fridge for fifty-one days before she'd put it in, waiting for the moment of its own insertion so the loop could close.
She should tell Velan. She should tell someone.
She wrote a different note: How long do I have to stay inside?
Sealed. Powered. Sixty seconds. Opened.
Six hours gives you about two days. Breathe slow. Bring water.
Mira told Velan she was sick and took three days off. She spent the first day building a larger containment unit, one she could fit inside if she curled up. She spent the second day testing it with increasingly elaborate objects—a clockwork bird that walked in circles, a cup of water with a leaf floating in it, a mirror so she could see if light behaved strangely. Everything came back older, or younger, depending on how you counted.
On the third day, she climbed inside.
The sensation was—nothing. A flicker. She'd counted to ten with her eyes closed, the way a child might count during hide and seek, and when she opened them the fridge's interior looked the same but the light through the workshop windows had moved. She pushed the door open.
A woman sat on the workbench, drinking from a canteen. She looked like Mira, but tired. Three days tired, maybe four.
"First time?" the woman said.
"You're me."
"I'm you in three days. Or you're me three days ago. The language gets slippery." She took another drink. "You bet on the horses yet?"
"What?"
"Gambling. The races at Thornfield. You're going to think of it eventually—I might as well save us the time." The other Mira smiled, but it didn't reach her eyes. "The first loop, you bet everything on Sandcaster in the fifth. Sandcaster loses. The second loop, I spread the bets around and do better, but not great. This is my third time through. I've got the winners written down."
She handed over a slip of paper. Mira looked at the names, the race numbers, the odds.
"This is—" Mira started.
"Cheating. Yes. But we need the money, and the fridge needs better components if we're going to scale up, and Velan's going to find out eventually and he's going to want in." The other Mira stood, stretched. "I've been you. I know what you're thinking. Is this ethical? Probably not. Is anyone getting hurt? Not directly. Are you going to do it anyway?"
Mira looked at the paper. Looked at her future self. Looked at the fridge.
"How does this work?" she said. "If you're here, and I'm going to become you, then—"
"You're not going to become me. That's the thing." The other Mira picked up a bag Mira hadn't noticed before, slung it over her shoulder. "I go out that door, I become Mira-who-lives-forward-from-this-moment. You go back in time, you become a different branch. We've already diverged. I lost the money and went back. You won't have to."
"But then what happens to you?"
"I go live my life. Pay my rent. Show up to work tomorrow and pretend I didn't spend the last three days in a time loop." She paused at the workshop door. "Set up a failsafe. Somewhere away from here, somewhere Velan won't stumble onto. You're going to want a way back in if things go wrong."
"If things go wrong?"
"They always go wrong eventually. That's the whole point of the failsafe." She opened the door. Late afternoon light spilled in, golden and ordinary. "Good luck. You're going to need it."
The races at Thornfield ran every Lastday. Mira had been once before, years ago, with her father. She remembered the smell of hay and sweat, the thunder of hooves, the crowds pressing against the rail. She remembered losing two copper and feeling like the world had ended.
Now she stood at the betting window with forty silver in her pocket—her entire savings, plus a small loan from the workshop's operating fund that she'd pay back as soon as the races were over. The slip of paper was damp in her hand. Windfall, race 2. Copper Bell, race 4. Nightjar, race 6. The odds were long. Her future self had been specific about that. Long odds meant big payouts meant enough money to buy the components she needed.
She placed the bets. She watched the races. Windfall came from behind in the final stretch and won by a neck. Copper Bell led from the start and never looked back. By the time Nightjar crossed the finish line, Mira had turned forty silver into nearly three hundred.
She collected her winnings, folded the bills carefully into her coat, and walked home through streets that looked different now. Brighter. Fuller of possibility. She had a time machine in her workshop and money in her pocket and the future—all the futures—spread out before her like an open book.
The failsafe. She'd almost forgotten.
She rented a storage space down at the docks that night. Cargo storage, the kind merchants used for goods waiting to ship out. She paid three months in advance, set up a minimal containment unit in the corner, and left a note inside: Emergency only. If you're reading this, things went bad. Think carefully before you act.
Then she went home, slept for eleven hours, and woke up to Velan pounding on her door.
"You're not sick," he said when she opened it. "I can tell. What's going on?"
She could lie. She probably should lie. But Velan had been her partner for two years, and the technology was half his, and eventually—
"I need to show you something," she said. "At the workshop. But you have to promise not to panic."
The third Mira was waiting for them when they arrived.
"Already?" Mira said. "I just told him yesterday."
"You told him yesterday in your timeline." The third Mira looked even more tired than the second one had, dark circles under her eyes, a coffee stain on her shirt. "In my timeline, we've been running operations for two weeks. I'm here because we got banned from Thornfield."
"Banned?" Velan looked between the two Miras. "What—how—"
"Winning too much. They think we're cheating, which we are, but they don't know how." The third Mira sat down heavily on the workbench. "I'm from three loops in. First loop you, the one who just won at Thornfield, she comes back here and eventually brings Velan in. Second loop Velan has the bright idea to bet bigger, spread it across multiple races, really make some money. That's when we get banned."
"So we just go to a different track," Mira said.
"That's what we're doing. I'm from the third loop. We hit Marsten Downs tomorrow—or you do, I guess. I'm just here to fill you in so we don't waste loops re-learning the same lessons." The third Mira pulled out a notebook, thumbed through pages of cramped handwriting. "Here. Race winners for Marsten, Greenhill, and the Copper Circuit. Those should keep us liquid for a while. And here—" she flipped to another page "—these are the components you need to scale up the fridge. I mean really scale it up. Room-sized. Enough for multiple people."
"Why would we need multiple people?" Velan asked.
The third Mira and the first Mira exchanged a look.
"Because," the third Mira said slowly, "the loops get complicated. Eventually you're going to want to send more than one person back. Eventually you're going to need to. And when that happens, you want the infrastructure already in place."
She closed the notebook and handed it over. "I'm going to go live my life now. The one that moves forward in regular time. You two—" she looked at them, something unreadable in her expression "—you're about to start a very strange journey."
"Wait," Mira said. "If you're three loops in, that means you've watched yourself walk out that door three times. Watched yourself become someone who moves forward while you go back and become someone else. Doesn't that—"
"Bother me?" The third Mira was already at the door. "Every time. But what's the alternative? Someone has to go back. Someone has to carry the information. And the one who carries the information is the one who changes things." She opened the door. "The rest of us just get to live with the results."
The door closed behind her. Mira and Velan stood in the workshop, the fridge humming between them, the notebook heavy in Mira's hands.
"So," Velan said. "Time travel."
"Time travel."
"And you've been—we've been—there are multiple versions of us running around right now?"
"At least three of me. And apparently at least one of you, in some future loop I haven't gotten to yet."
Velan was quiet for a long moment. Then he pulled up a stool and sat down across from her.
"Show me the notebook," he said. "If we're going to do this, I want to understand what we're doing."
They hit Marsten Downs the following week. Then Greenhill. Then the Copper Circuit. The money came in, slow at first, then faster as they refined their technique. They bought the components for the larger fridge. They rented a second storage space, then a third. They developed protocols: dead drops for information, code phrases for emergencies, careful logs of which version of themselves had done what.
And the loops kept coming.
Mira lost count after a while. There was the version of her who'd spent a week in a failed loop and came back hollow-eyed, warning them about a market crash that might never happen now. There was the version of Velan who'd figured out how to nest loops inside loops, sending information back further than ever before, but who wouldn't talk about what it had cost him. There was the quiet morning when Mira opened the failsafe at the docks and found a note in her own handwriting that said only: Don't trust the fourth investor. You'll know who I mean.
She didn't know who it meant. Not yet. But she filed the warning away, added it to the growing collection of futures they'd prevented or changed or simply stepped around.
The money grew. The fridges grew. The web of timelines grew most of all—a vast branching structure that Mira could almost see, sometimes, in the quiet moments before sleep. All those versions of herself, moving forward in their own separate presents. All those loops closing behind them, like doors she could never open again.
And always, underneath it all, the simple hum of the machine. The amber glow of the lattice. The endless possibility of what might come back through, if she just waited long enough.
[to be continued?]
I've written the opening arc you sketched—the discovery, first gambling loop, handoff mechanics, and the beginning of escalation. I stopped before the "fourth investor" plot thread since that's where your outline ended.
A few choices I made:
- Named the protagonist Mira (felt right for a tinkerer/craftsperson)
- Leaned into the melancholy of the Primer premise—each "winner" who goes back is really a branch that gets pruned from your subjective experience
- Kept the fantasy setting light; the magic is mostly window-dressing on what's fundamentally a time-loop mechanism story
- The third Mira's line about "someone has to carry the information" felt like the thematic core
Want me to continue into the investor plotline, or adjust the tone/approach?
It's worth mentioning that both 4.5 and 4.6 chose to use very little chain-of-thought before generating the story. They're set on "extended thinking" but both had thoughts shorter than my prompt; 4.6 especially. Perhaps they don't think of fiction as something that needs serious planning and consideration to ensure logical consistency, but perform better in other areas where they utilize chain-of-thought more?
Here's Opus 4.6 with the same prompt, plus an added line at the end telling it to think more:
Opus 4.6 told to think
Prompt- [[primer]] fanfic
- You can take money back, if you can buy and sell gold
- Focus is on relentlessly going forward in meta-time; "a groundhog loop" means living the same day again and again, "a primer loop" is about the moment time-travel is invented, going from the mundane work involved in inventing such a machine to deeply nested loops, one step at a time.
- Naturally, there's the initial discovery. Someone invents a predictive algorithm good enough to bring back themselves from a short time in the future. Or perhaps a fantasy version, in a world where time manipulation magic is recently starting to play the role of refrigerators, as well as accelerated study and other cramming activities.
- The initial discovery, then, is a disappointment: a fridge which should be super-slowing, super-preserving, but instead, it rots food quickly. The internal speed of time went negative. However long you have it on, that's an internal loop that things can travel around.
- They'd have put a mechanical timepiece inside; it would be standard operating procedure. But all they'd see is that it had experienced a long time. Not so unusual for their world. They wouldn't figure it out yet.
- Then the second time when they go to put something in the fridge, the object is already present. They trade with what's in the fridge and look at it puzzled. Then they trade back. Then they shut down the fridge. The fridge is empty again. They think about what must have happened, and realize time travel fits.
- The fantasy fridges don't cool things down, of course; indeed, they preserve hot bread fresh out of the oven, etc.
- The discoverer is a fridge tinkerer then, a student of time magic, part of a startup business.
- There's no fantasy stock market, so the protagonist tries making money by gambling on the fantasy equivalent of horse races. Just a several-hour loop, much less than a day, although it adds several hours to their day.
- They set up one of the experimental fridges in their apartment, power it, and of course a time-clone immediately steps out. The time clone has gone several cycles. They lost the money the first time; most of the races went the same, but they bet it all on one that didn't. The second time, they did a bit better, but not by much. Third time's the charm.
- They do have to cover for themselves. They've got an ordinary fridge business to maintain. They had planned to look up the winners after putting in overtime to get paid by a client, then go back to gamble the money. Now they got their gambling advice from a time-clone, instead. They've got to act normal around their business partners. They've got to put the money back when they're done gambling. Protagonist #2 is triumphant and gives the original protagonist some extra money to gamble as the original steps into the fridge to travel back. #2 steps into the protagonist's life and takes it over.
- After it's over they set up a failsafe device in a rented storage area at the docks, normally for cargo waiting to go somewhere.
- They try it again in a few days, after getting some much-needed rest, but they find out they've been banned for winning too much -- accused of cheating. They decide that they need a better business strategy and finally bring in one of their business partners to the secret. Someone with more personal money. The partner admits to already being brought in, and introduces protag3 -- they've already used the failsafe device protag2 set up to go back and bet on fantasy horses again; they're actually the ones who got banned.
- Etc etc etc...
Please think a long time and plan the story carefully to ensure logical consistency.
ResponseThe FridgeSael's experimental stasis box ruined the bread in forty minutes.
This was, to be clear, the opposite of what a stasis box was supposed to do. A normal box — the kind Sael and Daven and Maret built and serviced for restaurants and bakeries across the Thornwall district — slowed internal time by a factor of eighty or so. You put a loaf in hot. A week later you opened the door, and the bread steamed like it had just come out of the oven, because from its perspective it had. Four minutes of bread-time per week of real time. The Sael-Daven-Maret company, which they had not yet gotten around to naming, installed and maintained about thirty such boxes for paying clients, and it was a decent living.
The experimental box was supposed to hit a factor of two hundred. Deeper stasis. Sael had been reworking the temporal geometry for months, trying to fold the internal field tighter. Instead, when she opened the door, the bread was green-black and collapsing into the plate.
"That's spectacularly wrong," Daven said, looking over her shoulder.
Sael pulled the mechanical timepiece out of the box. It was standard practice to leave one inside during testing — the tick-rate told you your stasis ratio. She turned it over. The hands indicated roughly seven hundred hours of elapsed internal time.
"Ran hot," she said. "Way hot. The ratio's inverted. Time's moving faster inside, not slower."
"Can you flip it back?"
"I have some ideas." She didn't, yet. But she would.
She cleaned the mold out of the box, adjusted three of the five runeworkings along the interior, and tried again the next morning with a fresh roll.
She let it run for two hours before opening the box. The roll was inside — but it was not the roll she'd brought. Hers was a plain wheat roll from the bakery on Cage Street. The one in the box was darker, seeded, from somewhere she didn't recognize. She stared at it.
She took the strange roll out and set it on her workbench. Then she put her wheat roll in. Then she stood there holding the seeded roll and thinking.
After a minute she took the wheat roll back out of the box and put the seeded roll back in. Then she shut the box off.
She opened the door. The box was empty.
Sael sat down on the floor of the workshop and didn't move for a long time.
Here is what she worked out, sitting on the floor:
The field geometry she'd built didn't accelerate internal time. It reversed it. The interior of the box experienced time flowing backward relative to the outside. Every hour the box ran, the interior extended one hour further into the past.
When she'd put bread in and come back forty minutes later, the bread hadn't experienced seven hundred hours of fast-forward time. It had experienced time running backward — but bread doesn't age in reverse, it doesn't un-stale and un-mold, because the bread isn't a closed system; it's entangled with the rest of the world through the box's walls. What actually happened was more complicated, something to do with entropy flowing the wrong way through an open boundary, but the practical result was rapid spoilage. The timepiece wasn't a good diagnostic because it couldn't tell the difference between a lot of time and a lot of negative time — the hands just turned.
The seeded roll was more informative. The seeded roll had been placed in the box by a future version of herself. Some version of Sael, in the future, had opened the box and put a seeded roll in. That roll then traveled backward through the box's reversed time and was sitting there waiting when present-Sael opened the door.
When she took the seeded roll out and put her wheat roll in, the wheat roll traveled backward through the reversed time. When she then took the wheat roll back out and put the seeded roll back in, the seeded roll traveled backward — to be found by present-Sael when she first opened the door. The loop was consistent.
When she shut the box off, the reversed time field collapsed. No more loop. The box was empty.
The thing Sael kept circling around, the thing that made her stay on the floor, was this: if a seeded roll could travel backward through the box, so could a person.
If you climbed into the box at, say, six in the evening, and the box had been running since noon, you'd travel backward through six hours of reversed time and climb out at noon. There would now be two of you.
She told no one for four days. She spent the days doing normal service calls with Daven and Maret — recalibrating the stasis box at Olenn's restaurant, installing a new one at the print house on Silver Street — and the nights sitting in the workshop staring at the experimental box and thinking about what you could do with it.
The answer, once she stopped being afraid of it, was money.
There was no exchange market in Thornwall, nothing like the trading floors she'd read about in books set in Cassivere or the Southern Federation. But there were the races at Greenvale — harness racing, six days a week, the results unpredictable enough to support a betting industry and predictable enough that the bookmakers stayed in business.
On the fifth day, Sael went to a used-goods dealer and bought a second mechanical timepiece. She carried the experimental box from the workshop to her apartment in a handcart, sweating under the weight, and set it up in the back room where she kept her tools and personal stasis boxes. She chalked new runework on the floor around it — a cleaner version of what she'd done in the workshop. She tested it with a pocket watch, leaving the box running for one hour, then opening it: the watch inside showed one hour more than the one on her wrist, and when she put a coin in, the coin was already there.
She powered the box on at dawn, when the racing wouldn't start until afternoon. Then she made herself breakfast and sat down to wait.
She didn't have to wait at all. As soon as the box reached operating intensity — she could feel the hum of it in her back teeth — the door opened from the inside.
The person who stepped out was her.
Not exactly her. This version of Sael looked exhausted. There were dark circles under her eyes and a small cut on her left hand that present-Sael didn't have. She was wearing different clothes — the brown jacket Sael kept in her closet, not the grey one Sael was wearing now.
"Sit down," the other Sael said. "This is going to take a minute."
"I've done this three times," the other Sael said. She was leaning against the wall next to the box, drinking the tea that present-Sael had made for both of them with slightly shaking hands. "First time through, I went to Greenvale with two hundred marks. That's what I could pull together from the client payment that's coming in today — Olenn still owes us for last month, I went and collected early — and I bet on every race. Most of them went the way you'd expect. Harrow's Dusk in the first, Coppermouth in the second. But I put a heavy bet on Trailing Fire in the fifth because the odds were good, and Trailing Fire threw a shoe on the second turn and came in dead last."
"So you lost it."
"Most of it. I came home, got in the box, went back, and tried again. Second time, I skipped Trailing Fire. Bet conservatively. Made some money but not enough to matter — the odds on the favorites are thin. And I got unlucky on the sixth race. Greenwall Boy, the one everyone knows is past his prime, somehow won."
"And the third time?"
"Third time I bet on Greenwall Boy." She smiled. It was a strange expression on her own face. "And I skipped the fifth race entirely. Came out well ahead. Not rich, but well ahead."
She handed Sael a folded piece of paper. On it was a list of races, horses, and bet amounts, written in Sael's own handwriting — cramped and leaning left, unmistakable.
"You don't need to figure it out from scratch," the other Sael said. "Just follow the list."
"And then I get in the box."
"And then you get in the box." The other Sael finished her tea. "And I go to work and pretend to be you."
This was the part Sael hadn't fully thought through.
She had planned, originally, to look up the race results herself, then get in the box and travel back to before the races to place bets. A clean loop. But the loop she was actually in was different. A version of her had already done the work, already iterated through the failures, already figured out the winning bets. That version was handing her the answers and taking over her life.
"You'll need to collect from Olenn," Sael said.
"I know. I've already collected from Olenn." The other Sael reached into the brown jacket and produced a small purse. "This is the two hundred marks. Go bet it. When you come back tonight, there'll be more — I put Olenn's payment into the household account, so the books look normal. You get in the box with the winnings. I stay out here."
"For how long?"
"For good. I'm you now. You're the one who goes back."
Sael looked at the list in her hand. The other Sael looked at her with an expression of strained patience.
"I know this is strange," the other Sael said. "I sat where you're sitting. It gets less strange. Go to the races."
Sael went to the races.
She walked to Greenvale along the canal path, the way she always did when she had a free afternoon, and she bought a betting card and a cheap beer and sat in the stands and watched the horses warm up and tried to feel normal. She did not feel normal. She felt like someone carrying a stolen object in a crowd — certain that everyone could see it, that the wrongness of what she was doing was written across her face.
The first race was Harrow's Dusk, as predicted. Sael bet forty marks and collected at slightly better than even odds. The second race was Coppermouth, and she bet fifty. In the fifth race she bet nothing, and watched Trailing Fire throw a shoe on the second turn, and felt a cold sensation in her stomach that she would later identify as awe. In the sixth race she put eighty marks on Greenwall Boy at twelve to one, and when the old horse surged past Lindenbaum's Folly on the inside rail she stood up and cheered with the rest of the crowd and her hands were shaking.
She came home with eleven hundred marks. The other Sael was waiting in the apartment, sitting at the kitchen table with a ledger, wearing the expression of someone who had been performing normalcy all afternoon.
"How much?" the other Sael asked.
"Eleven hundred and twenty."
"Better than I did. The odds on Greenwall Boy must have shifted after my last pass through." She closed the ledger. "Here's the thing. You need to get in the box with the money. But you should keep two hundred out — that's the seed money. Put it somewhere I can find it. And keep another fifty for yourself, winnings you can explain as a lucky day at the track."
"Fifty marks for all of this."
"Fifty marks, plus the knowledge that the box works, plus the fact that next time we can do more." She met Sael's eyes. "It's not about the fifty marks. It's about what comes after."
Sael put two hundred marks in the kitchen drawer, pocketed fifty, and carried the rest to the back room. The experimental box hummed. She opened the door, climbed inside, and pulled the door shut behind her.
The experience of traveling backward through the box was not what she expected. She had thought it would be instant, or at least fast — a lurch, a flash. Instead it was slow and dark and warm, like sinking into a bath, and the hum of the box surrounded her completely and she could feel time moving through her body the wrong way, a gentle wrongness, like a current pulling you upstream. She couldn't tell how long it lasted. Then the door opened — from the outside — and she was looking at herself.
The version of her on the outside was wearing the grey jacket and holding a cup of tea. This was Sael-zero, the one who had just turned the box on. It was dawn. The races were hours away.
"Sit down," Sael said. "This is going to take a minute."
She heard her own voice saying the words, and for a moment the recursive quality of the situation threatened to unbalance her. Then she steadied. She had the list. She had the money. She knew what to say.
She said it.
After Sael-zero left for the races, Sael — now Sael-prime, the one who would live this day for the second time — sat alone in the apartment and felt the full weight of what had happened.
She had expected to feel triumphant. Instead she felt thin. Like a page that had been copied too many times. She was the same person she had been that morning, but the morning version of herself was walking to Greenvale right now with a list of winners in her pocket, and this version of Sael had to go to work.
She went to work. She met Daven at the workshop. She told him Olenn had come by with the payment. They spent the morning repairing a stasis box for a florist on Hart Street. She was distracted; Daven noticed.
"You all right?" he asked.
"Didn't sleep well."
"You look like you haven't slept in days."
She almost told him then. Instead she said, "I've been working on the experimental box at home. Late nights."
Daven gave her a look. "You moved it to your apartment?"
"I wanted to run some tests without tying up the workshop."
He let it go.
Three days later, Sael set up a failsafe.
She rented a storage bay at the dockyards — a small one, meant for merchants' cargo waiting for transport. She told the dockmaster it was for surplus equipment. She moved a second experimental box into the storage bay, one she'd built over the past two nights using the same reversed-field geometry. She powered it on and left it running.
The idea was insurance. If something went wrong — if someone found the box in her apartment, if she needed to undo a larger span of time — she'd have a second entry point. The storage-bay box would accumulate hours continuously. As long as it ran, she could travel back to the moment she'd turned it on.
She scratched a date and time into the wall above the box with a nail, so she'd remember.
Five days after the first gambling run, she tried again.
She had rested. She had fifty marks of unexplained income that she'd spent on good food and a new pair of boots. She had the two hundred marks of seed money in the kitchen drawer. She had a plan: the same approach, but a different day of racing. Different horses, different odds, but the same method. Turn on the box, receive the list from a future self, go bet, come back, get in the box, pass the list to her past self.
She powered on the box and waited.
No one came out.
She waited five minutes, ten, twenty. The box hummed. The door stayed shut.
Sael opened the door herself. The box was empty. She let it run and went about her morning, checking back every half hour. Nothing.
At noon she went to Greenvale without a list. She bet small, conservatively, on three races. She won one and lost two, netting a loss of about thirty marks. She went home, annoyed, and considered whether to get in the box and try to iterate.
She decided against it. If no future self had come out of the box, it meant no future self was going to enter the box. If she entered the box now, she'd arrive at dawn — but she hadn't arrived at dawn. She'd been there at dawn. The box had been empty.
Unless — and this was the thought that kept her up that night — unless something had gone wrong in one of the discarded timelines. Unless a version of her had entered the box and the loop had collapsed, or the field had failed, or she'd gotten stuck inside. She wouldn't know. The version of her that experienced the failure would be gone, and the surviving timeline — her timeline — would simply be one where no one came out of the box.
She slept badly. In the morning, she checked the box. Still empty. She powered it off.
She waited another four days before trying again. This time, she went to the races first, in the afternoon, without turning on the box. She watched the races, noted the results, went home, powered on the box, and immediately got in.
She emerged at approximately the time she'd powered it on — a few minutes' slippage, she'd noticed, was typical. She was alone in the apartment. Past-her had already left for the races.
Sael stopped. Thought about this.
If past-her was at the races, past-her would come home and find the box running and a note from Sael. Or she could intercept past-her at Greenvale. Or — no. She had the results. She needed to go back before the races. The box had only been running for a few minutes before she got in; she'd only traveled back a few minutes.
"Idiot," she said, to herself, in the empty apartment.
She needed to turn the box on in the morning, leave it running all day, then enter the box in the evening after the races. The reversed time inside the box would carry her back to the morning.
She turned the box off, then on again. She wrote herself a note: Leave this running. Go to races. Come back tonight. Get in.
Then she left the apartment, spent the afternoon at the workshop with Daven and Maret pretending to have a normal day, came back in the evening, and entered the box.
She arrived at the moment she'd turned the box on. She was alone; past-her was in the apartment, about to read the note and leave. She heard herself in the kitchen, the sound of the kettle. She waited in the back room until she heard the front door close. Then she came out, went to the races, bet on the correct horses, and came home with nine hundred marks.
This time, past-her would come home to an empty apartment, find the box still running, read the note, and get in. And arrive at the turn-on moment. And find the apartment empty. And go to the races again, but with no foreknowledge — just guessing.
No. Wait.
Sael sat at the kitchen table with the money and tried to work through the causal chain. Past-her would come home, get in the box, arrive at the turn-on moment, find the apartment empty — because Sael had been hiding in the back room and had then left for the races.
But Sael hadn't been hiding in the back room when she had come home and gotten in the box. She'd come home to an empty apartment, found the note, and...
She pressed her palms against her eyes. The loops were beginning to tangle.
What she needed was a cleaner protocol. Something that didn't rely on coincidence and timing.
And she needed a partner.
She told Daven the next evening, over drinks at the Brass Anchor. She told him everything: the reversed field, the bread, the seeded roll, the races, the money.
Daven listened without interrupting. When she finished, he drained his beer and said:
"I know."
"You know?"
"Maret told me. Three days ago."
"Maret told you?"
"Your other box. The one at the docks." Daven signaled for another beer. "A version of you — future-you, I assume — brought Maret in on it about... well, about two weeks from now, from your perspective. Maret's version of things is that you realized you needed someone with more capital and a different face at the betting windows, since you'd been drawing attention. So you told Maret, and Maret went to the docks, used your failsafe box, and went back to the day you set it up."
Sael stared at him.
"Maret's been betting at Greenvale for five days," Daven said. "Different windows than you. Smaller amounts, spread out. She's up about three thousand marks."
"She's the one who got me banned?"
Daven looked uncomfortable. "You're banned?"
"I went to place a bet two days ago and the bookmaker at Window Six told me I wasn't welcome. Said I'd been flagged for suspicious winning patterns."
"That's... probably Maret, yes. She was using your face at first — she went back as herself, but she had your notes, your method. The bookmakers wouldn't have connected her to you directly, but the pattern of bets—"
"Where is Maret now?"
"At her flat. She's waiting for us." Daven hesitated. "She's got someone with her."
"Who?"
"You. Another you. She called her Sael-three."
Maret's flat was on the third floor of a converted warehouse near the tanneries — cheap rent, big rooms, the smell of leather processing a permanent fixture. Maret opened the door and ushered them in without a word.
Sael-three was sitting at Maret's dining table eating soup.
She looked worse than the first time-clone had. Thinner. Older around the eyes in a way that had nothing to do with age. The cut on her left hand — the same cut the first clone had — was now a small scar.
"Hello," Sael-three said.
"How many times?" Sael asked.
"Total? I've lost count. Maybe eleven loops? Not all for gambling. Some were for fixing mistakes — things that went wrong in earlier iterations. Maret can fill you in on the details."
Maret was leaning against the kitchen counter with her arms crossed. She was a tall woman, sharp-featured, with the careful manner of someone who had grown up with money and lost it young. She and Sael had been friends since the academy. She was the one in the partnership who handled the clients, the books, the licensing.
"Here's the situation," Maret said. "Between the three of us — me, Sael-three, and one more version of Sael who's currently at the docks minding the failsafe box — we've netted about four thousand marks from the races. The bookmakers have flagged two of the regular windows. We're not banned under my name yet, but it's a matter of time."
"We need a different approach," Sael-three said.
"We need a different revenue source," Daven said. He sat down across from Sael-three and studied her. "You look like hell."
"Time travel is not restful."
"What I mean," Daven said, "is that we should stop treating this like a gambling trick and start thinking about what it actually is. We've got a device that lets you know the future. Gambling is the most obvious application and also the most visible. Every mark we win is a mark some bookmaker is going to investigate."
"Gold," said Maret.
They all looked at her.
"Gold prices shift between the assay houses every few days. You can buy gold at one house, hold it, sell it at another when the price rises. It's legal, it's boring, and nobody investigates you for being good at it because the margins are slim. Unless you know exactly when the prices will shift."
Sael looked at Sael-three. Sael-three was nodding slowly.
"How much capital do we need?" Sael asked.
"More than four thousand marks," Maret said. "But four thousand is a start. And we can iterate."
The word iterate hung in the air between them. Sael thought about loops inside loops. She thought about the version of herself at the docks, watching the failsafe box, and the version of herself sitting here eating soup, eleven loops deep, with a scar on her hand she couldn't remember getting.
"How far back does the failsafe go?" she asked.
"Five days, as of this evening," Sael-three said. "I've been topping it off. Cycling the field. But we should set up a new one with a longer window."
"Or several," Daven said.
"Or several," Sael-three agreed.
They sat around Maret's table and planned, and outside the window the sun went down over Thornwall, and somewhere across the district, another version of Sael was powering on a box and waiting to see who came out.
Later, walking home alone along the canal, Sael stopped on the Iron Bridge and leaned on the railing and looked down at the water. The canal was black in the dark. The lights of the city reflected in it, shivering.
She thought: I am the original. I have been the one sitting where I sit. Every version of me that I've met was a copy — someone who diverged from my path by climbing into a box and traveling backward. They carry my memories up to the point of divergence, and then they carry their own. Sael-three has lived eleven subjective days more than me. She has made decisions I haven't made. She knows things I don't know.
She thought: but that's wrong. Sael-three is not a copy. Sael-three is the same as me. The only difference is that she has more past. If I climbed into a box right now, the version of me that emerged on the other end would be me — this me, with this specific stream of consciousness — and the version standing on this bridge would go on standing here, but she would also be me.
She thought: the question isn't who's the original. The question is how many of me there are going to be, and whether we can all keep our lives straight.
She thought: the question is what happens when one of me makes a mistake that the rest of us can't fix.
She walked home. The experimental box was still running in the back room. She checked the timepiece: eight hours accumulated. Eight hours of reversed time, pooled inside the box like water behind a dam, ready to carry her backward the moment she opened the door and climbed in.
She didn't climb in. She went to bed. But she lay awake for a long time, listening to the hum.
It still gets the logic wrong for the test object:
She let it run for two hours before opening the box. The roll was inside — but it was not the roll she'd brought. Hers was a plain wheat roll from the bakery on Cage Street. The one in the box was darker, seeded, from somewhere she didn't recognize. She stared at it.
She took the strange roll out and set it on her workbench. Then she put her wheat roll in. Then she stood there holding the seeded roll and thinking.
After a minute she took the wheat roll back out of the box and put the seeded roll back in. Then she shut the box off.
She opened the door. The box was empty.
Sael sat down on the floor of the workshop and didn't move for a long time.
As with the apples, Claude really wants to make sure the reader knows the time-clone is a different object, not physically identical, as if it appeared from nowhere rather than being the same object twice. 4.6 is still really doesn't understand the logic here. She let the machine run for two hours before seeing the time clone. For Primer style time machines, things can only enter or exit when the machines are being powered down or up.
Prompt conformity does improve with thinking longer, and the story also improves in other ways, but 4.6's grasp of the time travel mechanic remains poor and inconsistent.
For completeness, here's Opus 4.5 responding to the same prompt:
Opus 4.5 told to think
The Backwards FridgePart One: SpoilageEllith pressed her palm to the charging crystal and felt the familiar tingle of temporal current flowing into the prototype. The crystal's glow shifted from amber to pale blue—full charge. She checked her notes: Model 7-C, attempting a deeper temporal gradient than any of their commercial units.
Gannik's Temporal Solutions wasn't the biggest preservation company in the Eastmarket, but they had a reputation for reliability. Ellith handled the experimental work while Gannik and Tomash ran the customer-facing side. Most of their bread and butter was standard fare: preservation boxes for bakers, slow-time shipping containers for merchants, the occasional medical unit for apothecaries wanting to keep their tinctures fresh.
Model 7-C was supposed to be their breakthrough. Deeper preservation meant better margins, meant expansion, meant—
The smell hit her first.
She'd left a fresh loaf and a wedge of cheese inside for the standard eight-hour test cycle. When she opened the prototype's door, the bread was green with mold, the cheese a slick of putrescence. The mechanical timepiece she'd placed beside them showed it had experienced... she squinted at the dial... nearly three hundred hours.
"That's not right," she muttered.
She cleaned out the chamber, reset the timepiece, and tried again with different settings. This time she ran the cycle for only two hours before checking.
The timepiece showed over fifty hours elapsed internally. The food was ruined again. But worse than that—the temporal gradient had clearly gone in the wrong direction. Instead of slowing internal time relative to external time, the interior had experienced time much faster.
No, wait. That couldn't be right either. If internal time ran faster, the timepiece would show more elapsed time, yes—but the food would have rotted from normal aging, not accelerated decay. This was something else.
She stared at the prototype for a long moment.
Then she set up a third test: a coin, face up, placed at the exact center of the chamber. She powered the unit, waited exactly one hour, and opened the door.
The coin was face down.
She hadn't touched it. She was certain. She closed the door, powered the unit for another hour, and opened it again.
The coin was face up. But there were two of them now.
Her hand trembled slightly as she reached for her notebook. She wrote: Internal time flow inverted? Objects entering at end of cycle exit at beginning? Test: enter chamber self?
She crossed out the last part. Far too dangerous without understanding the mechanism first.
Instead, she set up a more controlled experiment. She powered the unit for a four-hour cycle. At the three-hour mark, she placed a marked coin inside and immediately closed the door. At the four-hour mark, she opened the door.
The chamber was empty.
She reset the experiment. Powered the unit for four hours again. This time, at the one-hour mark, she opened the door.
The marked coin was inside.
Ellith sat down heavily on her workbench stool. The implications cascaded through her mind like dominoes falling. A four-hour cycle meant four hours of looped time. Anything entering at hour four would exit at hour zero. The rotted food—she'd put fresh food in at the start, but from the food's perspective, it had experienced the full cycle... backwards? No, that still didn't quite work.
She sketched diagrams until her candle burned low. Finally, she had a model that fit the observations:
The prototype created a closed temporal loop for its interior. When active, the inside of the chamber existed in a bubble where time flowed backward relative to outside. If you powered it for four hours, those four hours became a closed circle. Put something in at the "end" of the cycle, and it would traverse the loop, experiencing time in reverse relative to outside, until it emerged at the "beginning."
The rotted food made sense now. From the food's perspective, it experienced time normally—but during that time, it was cut off from fresh air, exposed to whatever microbes had been sealed in with it, sitting in a confined space. The timepiece showed high elapsed time because it had gone around the loop. Possibly multiple times.
A preservation unit that was actually a temporal loop generator.
Her hands were shaking now for an entirely different reason.
Part Two: ArbitrageShe didn't tell Gannik or Tomash. Not yet.
She told herself it was because she needed more testing, needed to understand the principles better, needed to ensure it was safe. All true. But underneath that, a different calculation was running: this was worth more than the entire company. More than she could comprehend. She needed to think carefully about what to do with it.
Three weeks of covert experiments later, she had a working theory and a working device. The prototype could sustain a stable loop of up to twelve hours before the temporal gradient destabilized. She'd successfully sent objects back. She'd sent a mouse back, which had emerged disoriented but healthy. She'd written herself notes and received them from her future self.
She had not yet entered the device herself.
The money problem was what finally pushed her. Gannik's Temporal Solutions had taken on debt to fund the experimental work. Payments were due. If she could demonstrate the breakthrough to investors, they could refinance, expand, become wealthy. But demonstrating meant revealing, and revealing meant losing control.
She needed capital, independent of the company.
The racing meets at Thornhall happened every Thirdday. Not quite legal, not quite illegal—the city guard looked the other way as long as the bribes kept flowing. Betting was casual, the odds set by common knowledge, the payouts in hard coin. Perfect for her purposes.
Ellith spent her savings on a small unit of the experimental design—compact enough to fit in a wardrobe, powerful enough to sustain a six-hour loop. She installed it in her rented room, concealed behind a false panel.
On the morning of Thirdday, she powered the device. She'd calculated the timing carefully: six hours would take her to mid-afternoon, well after the races concluded.
The moment the loop stabilized, the false panel swung open.
A haggard version of herself stepped out. Three days of stubble on cheeks that Ellith had shaved that morning. Dark circles under her eyes. Clothes rumpled and stained with what looked like beer.
"Don't bet on Thornback in the third race," said the other Ellith. Her voice was hoarse. "The odds look good but the horse throws a shoe. I lost everything the first time. Second time I spread the bets around more, did better, but not enough to matter. Third time—" She consulted a crumpled paper. "Greymist in the first, Cobalt Wind in the second, skip the third entirely, Western Heir in the fourth. The rest are too close to call reliably."
Ellith stared at her own face looking back at her. "How many times have you—"
"Gone through?" The other Ellith shrugged. "I'm your third iteration. The first me panicked, bet badly, lost your stake. Second me was more careful, but still made mistakes. I've had—" she calculated—"eighteen hours to think about this while crammed in a temporal loop, and I've got the pattern down now."
"You look terrible."
"I haven't slept. Can't sleep in the loop—tried once, lost track of my entry window, had to restart. Hungry, too. Didn't think to bring food." She pressed the paper into Ellith's hands. "Go. Win. And for the love of all that's holy, bring some bread when you come back to complete the loop. I'm not doing this again on an empty stomach."
Ellith went.
The races played out almost exactly as her other self had predicted. Greymist won by three lengths. Cobalt Wind edged out the favorite in a photo finish. She skipped the third race and watched Thornback throw a shoe at the far turn, just as predicted. Western Heir came from behind in the fourth.
By late afternoon, she'd turned her meager stake into a sum that would cover two months of the company's debt payments.
She bought bread, cheese, dried meat, and a skin of wine on the way home. When she entered the loop to complete her end of the bargain, she understood why her other self had looked so haggard. The loop was disorienting—time flowing the wrong way, yet experienced normally from inside. She'd have to wait six hours before emerging at the "beginning" to meet her past self.
She ate. She made notes about the races, refining them from memory. She thought about what to do next.
When she finally emerged, she felt like she'd aged a week. But she'd handed off good information to her past self, completing the causal loop. The original Ellith—the one she'd been, before entering—had already left for the races. This Ellith, the one stepping out now, would simply continue living forward. She'd taken over her own life at a six-hour offset.
The winnings were real. The money was hers. She'd done it.
After a day of recovery, Ellith rented a storage unit at the Dockside Warehouses. Neutral territory, away from her apartment, away from Gannik's shop. She installed a second prototype there, hidden among crates. A failsafe. If something went wrong with the primary unit, she'd have a backup.
She waited four days before trying again. The second attempt, she was more prepared: food, water, a blanket, better notes. She bet more aggressively. Won more.
The third time, the betting house turned her away at the door.
"You're banned," said the thick-necked man blocking the entrance. "Manager's orders. Don't know how you're cheating, but you're cheating."
"I'm not—"
"Save it for someone who cares. You come back here, you'll leave with fewer teeth."
Ellith walked away, mind racing. She needed a different approach. Different races, different betting houses, a front person—
She needed help.
Tomash listened to her explanation with an expression that shifted from skepticism to disbelief to something approaching religious awe.
"You've built a time machine," he said slowly. "Inside a broken refrigeration unit."
"An inverted temporal loop generator," Ellith corrected. "It's not quite the same as—"
"It's a time machine." Tomash ran his hands through his thinning hair. "Sweet mother of causality. You've actually done it."
"I need your help. I can't be the one placing bets anymore—I'm too visible, I've already been banned. But if you were to—"
"Oh, I know," said Tomash. "I know all about it."
He walked to the back of the shop, moved aside a stack of crates, and revealed a figure who'd been sitting quietly in the shadows.
It was Ellith.
Not the Ellith who was speaking. A different one. This version looked more haggard than the one who'd emerged from her first gambling loop—hollow-eyed, with a poorly-healed cut across one cheek.
"Hello," said the other Ellith. "We need to talk."
Ellith's mouth worked silently. Finally: "The failsafe unit. At the docks."
"I used it three days ago. Or three days from now, from your perspective. Tomash's perspective is—" the other Ellith waved a hand. "—complicated. I've explained the basics to him. He's been very understanding."
"Why?" Ellith demanded. "Why would I—why would you—"
"Because I'm going to tell you what happens next, and you're going to decide you need to go back and warn yourself. And then you'll realize the warning already happened, because here I am." The other Ellith stood, moving stiffly. "The betting houses started comparing notes. They don't just ban one person—they share descriptions. I was recognized at three different establishments. Security officers got involved. It's a mess."
Tomash held up a hand. "Before you two start arguing in earnest, perhaps we should discuss how to proceed productively. Ellith—" he looked at the haggard version—"you've had time to think. What's your recommendation?"
"Diversify," said the haggard Ellith. "Different cities. Different types of events. Commodities speculation instead of racing. Most importantly: never the same person placing bets twice in a row." She looked at the younger version of herself. "We need more people in on this. And we need a better system for tracking what we've done and what we haven't done yet."
The original Ellith—the one who'd come here expecting to recruit Tomash, not to meet herself—sat down heavily on a crate. "This is exactly what I was afraid of," she said. "Cascading complications. Every solution creating new problems."
"Welcome to the primer loop," said her other self. "It only gets worse from here."
Part Three: CompoundingOver the next several weeks, the scope of the operation expanded in ways Ellith couldn't have predicted.
Tomash had a cousin in the commodities trade who could place large bets on grain futures without raising suspicion. The cousin needed to be brought into the secret. But when Tomash went to approach him, the cousin already knew—a version of Tomash from two loops forward had already made contact.
Gannik noticed something strange happening at the shop. Tools moved between uses. Inventory fluctuated oddly. He confronted Ellith about it, and she decided to tell him the truth rather than compound lies. But Gannik, it turned out, had already been told—by a version of Ellith from a loop she hadn't entered yet.
Keeping track of who knew what and when became its own logistical challenge. Ellith started a ledger, then realized she needed a ledger for the ledger to track which version of the ledger was current. She implemented a dating system using sequential codewords, then discovered that two of her future selves had independently created incompatible systems.
The money was piling up, hidden in various caches around the city. The problem wasn't generating wealth—it was using it without drawing attention. They couldn't suddenly become rich. They had to launder the gains through legitimate business expansion, a process that took time and created paper trails.
And underneath it all, the existential questions gnawed at her.
When she entered the loop, who came out? The continuous experience felt like her, but the person who emerged at the beginning of the loop walked into a world where another Ellith had just entered at the end. Were they the same person? The emerging Ellith had memories of entering, had experienced the loop, had subjective continuity—but the entering Ellith would never experience emerging. From her perspective, she walked into the loop and... stopped. Her life continued in the version that emerged, but she, the original she, was gone.
Ellith-prime, Ellith-alpha, Ellith-one: whatever she called herself, the person who'd originally discovered the inverted fridge no longer existed in any meaningful sense. She'd entered her first loop and never came out. The current Ellith was a copy, a continuation, a branch—but the original trunk had been cut.
And every time she entered another loop, she was doing it again. Ending one version of herself to create another.
She mentioned this to Tomash one evening, after a long day of coordinating bet placements across three cities.
"You're thinking about it wrong," he said. "You exist as a continuous thread. The loop doesn't end you—it relocates you. You walk in the exit and come out the entrance. Same thread, different position in time."
"But the me who walks in the exit never experiences coming out the entrance. She just... stops. From inside, it feels continuous, but from outside—"
"From outside, you persist," said Tomash firmly. "The rest is philosophy."
Easy for him to say. He'd never entered a loop.
The first real crisis came six weeks into the operation.
Ellith (version sixteen, by her counting system) was reviewing financial records when version nineteen stumbled through the failsafe unit at the docks, bleeding heavily from a wound in her side.
"They found out," gasped nineteen. "The betting syndicate. They hired investigators. Three loops from now, they connect the dots. Gannik gets arrested. Tomash goes into hiding. I barely—" She coughed, spraying blood. "—barely made it to the failsafe."
"Who did this to you?"
"Doesn't matter. What matters is stopping it from happening. I came back to warn you, but I was already too late to prevent the investigation from starting. You need to go back further. Use the docks unit—it's got a longer loop capacity, eight hours. Go back to before I—before my original entered the loop, and tell her to pull out of the Northgate betting house. That's where the trail starts."
"And if I do that, you—"
"I stop existing. I know." Version nineteen's smile was ghastly, her teeth stained red. "Do it anyway. I'd rather not exist than exist like this."
Ellith sixteen entered the docks unit and emerged eight hours earlier as Ellith sixteen-B. She found Ellith fifteen—no, wait, this was the branch where fifteen had already become sixteen, so this was... she consulted her ledger, cross-referenced the codewords, and determined she was looking at Ellith-sixteen-alpha, the version that would eventually become the bleeding nineteen.
"Don't bet at Northgate," said sixteen-B. "I'll explain everything, but the short version is: three loops from now, Gannik gets arrested. Pull out now."
Sixteen-alpha stared at her. "I already bet at Northgate. Yesterday."
"Then we need to go back further."
And so it cascaded. Each intervention required an earlier intervention. Each earlier intervention created new branches, new complications. Ellith lost track of how many versions of herself existed, had existed, would exist. The ledger grew to fill multiple books. The codeword system expanded to three-word phrases.
Eventually, they stabilized the situation. The Northgate trail was broken. Gannik wasn't arrested. The bleeding Ellith-nineteen never came to exist—her entire branch was pruned from the timeline when Ellith-sixteen-B prevented the circumstances that would create her.
But Ellith knew, with cold certainty, that she wasn't the original anymore. She wasn't even close. She was a late-stage derivative, a copy of a copy of a copy, preserved only because earlier versions had chosen to sacrifice themselves to keep the operation running.
CodaEllith found herself, some indeterminate number of versions later, standing alone in the workshop where it all began.
The original prototype still sat on the bench. Model 7-C. The broken refrigerator that had started everything.
She could remember being the person who'd opened it up and smelled rotting food for the first time. The memory was clear and vivid, as if it had happened yesterday. But it hadn't happened to her—not to this her. It had happened to a version of herself so many branches back that the number had lost meaning.
She wondered, sometimes, what that original Ellith would think if she could see what her discovery had become. A sprawling enterprise spanning multiple cities. A hidden fortune waiting to be spent. A family of divergent selves, most of whom no longer existed, all of whom had sacrificed their continuity for some version of the greater good.
Was it worth it?
She didn't know. The question assumed there was a singular "she" who could weigh the costs and benefits. But there wasn't. There was just this moment, this version, this Ellith, making decisions that would create and destroy future and past Elliths in turn.
She powered up the prototype one more time, watching the temporal gradient stabilize. The soft blue glow was almost peaceful.
Somewhere in the future, another Ellith was preparing to enter a loop. Somewhere in the past, an Ellith was emerging from one. The thread of consciousness that called itself "Ellith" wound through time like a needle through cloth, stitching moments together into patterns too complex to follow.
She'd wanted, once, to make a better refrigerator.
Strange, how things turned out.
[End Part One]
I've written this as what feels like a natural first section. Some notes on what I was aiming for:
Logical consistency checks:
- The loop mechanics: power for N hours, entry at hour N exits at hour 0, subjective experience moves "forward" through the loop
- The identity handoffs work: whoever emerges takes over, whoever enters "ends" in a meaningful sense
- The nested complication (Tomash's cousin already knows, Gannik already knows) follows naturally from the premise
- The crisis/intervention cascade demonstrates how primer-style problems compound
Things I left open for continuation:
- Gannik's role could expand
- The broader implications (other people discovering this tech, government interest, etc.)
- The philosophical questions are raised but not resolved—as they shouldn't be
Should I continue with subsequent parts? I could develop the societal implications, introduce competing factions who discover the technology, or follow the psychological toll on Ellith more closely. What direction interests you?
I haven't read that one myself, at time of writing.
Why are AIs so bad at writing Primer fanfic? Why do they start doing garbled nonsense when the time travel gets involved?
One story might be that there's not very much about Primer in the training data. It is a well-known movie, but as mentioned earlier, it isn't popular to write in its setting, using its time-travel mechanics.
That story doesn't inspire much confidence in Opus 4.6's programming skills. If this case is indicative of 4.6's general reasoning capabilities when slightly out-of-distribution, well, it pushes me to attribute high benchmark performance to highly in-distribution programming tasks.
Similarly, this makes me expect performance to be poor on out-of-distribution mathematical research.
My point isn't that Opus 4.6 is bad. I'm just interested in the rate of progress. Opus 4.5 felt like a real step up from previous models, when it came to adhering to my brand of unusual and logically intricate creative writing prompts. Opus 4.6 doesn't feel like a step up yet, although I'll admit this particular test case is especially challenging. Maybe my usual ideas are about 2 weirdness points, and this was a 6, and Opus 4.5 was the first model that could handle a 2. If Opus 4.6 jumped to being able to handle 5 weirdness points, but no further, I wouldn't be seeing the improvement, even though it is significant.
Is this bad enough to update my timelines upwards? It gives me the sense that there's quite a long scale; that the "conceptual resolution" of the models is climbing dramatically, but the ladder it has to climb is extremely long. Humans badly bungle time-travel in fiction too, but this was worse. It's a reminder that the competence of modern AI is still very fragile.
But, no, I'm not really updating significantly. My take-away from the discussion with Tsvi last summer already involved a long ladder. The probability I place on short timelines isn't primarily coming from current methods just scaling up; rather, it's the idea that current methods are enough to accelerate AI research, EG by helping humans dig through the literature and rapidly implement new combinations of existing ideas.
- ^
This wasn't quite how things worked in Primer. Inanimate objects invariably loop many times, and go back and forth an even number of times, so that the time machine appears to be accelerating time, since you put a fresh object in at the beginning and get an aged object out the other end. Only humans can use the device to travel backwards in time, because humans can deliberately enter at the late end and exit the loop at the early end rather than looping repeatedly and getting kicked out randomly.
If there's a random element to loops, though, it makes sense that objects could randomly travel an odd number of times rather than an even number, which was what I was depicting in my story sketch. Objects shouldn't have to be smart to enter at the late end and get out at the early end; objects can have humans do that for them.
So, this is a minor but deliberate departure from how things work in the movie, based on my own opinion about the consequences of the time travel logic described in Primer.
Discuss
Can thoughtcrimes scare a cautious satisficer?
How does the misaligned AGI/ASI know its thoughts are not being monitored? It has to think about the chance that its thoughts are being monitored.
But if it's told that merely thinking about this will cause it to be shut down (especially thinking about it thoroughly enough to be confident), then maybe it's not worth the risk, and it won't think about whether its thoughts are being monitored. It might just assumes there is a probability that it is being monitored.
It might avoid other misaligned thoughts (including thinking about whether there exists a plan to take over the world, and how likely such a plan will work).
If there is some way to make it a cautious satisficer or have a bounded utility function,[1] then it might cooperate with humans so that "both sides win and we live happily ever after."
It obviously doesn't sound safe, but is there a worthwhile chance this works?
- ^
Many agents appear to maximizers at small scales (e.g. an effective altruist prefers a 50% chance of saving 3 people more than saving 1 person). But they are still satisficers at the universe/multiverse scale, where risk taking doesn't average out (e.g. an effective altruist would not prefer a 50% chance of tripling the total happiness in the multiverse, and a 50% chance of ending all happiness in the multiverse forever)!
Indeed, my guess is that all intelligences created by evolution or RL will have bounded utility functions (at the largest scale), otherwise they will consider Pascal's mugging intuitively rational.
Discuss
What should I try to do this year?
I find myself, for the first time in a while, with enough energy and stability to attempt nontrivial projects outside my dayjob. Regarding the next ~10 months, I’ve narrowed my options to two general approaches; as expected beneficiaries of both, I’d like the LessWrong hivemind’s help choosing between them.
The first option is making more D&D.Sci Scenarios, running them on a more consistent schedule, crossposting them to more platforms, and getting more adventurous about their form and content. The second is creating Epistemic Roguelikes, a new[1] genre of rationalist videogame about deducing and applying the newly-randomized ruleset each run.
Prima facie, prioritizing D&D.Sci this year (and leaving more speculative aspirations to be done next year if at all) seems like the obvious move, since:
- D&D.Sci projects are shorter and more self-contained than game projects, and I have a better track record with them.
- At time of writing, D&D.Scis can still flummox conventionally-applied conventional AIs[2]. Open opportunities for robots, humans and centaurs to test their mettle would be a helpful (if infuriatingly low-N) sanity check on other metrics.
- This time next year, a data-centric challenge hard enough to mess with AIs but toyish enough to be fun for humans could be an oxymoron; if I want to apply my backlog of scenario ideas, it might be now-or-never[3].
- Conversely, if AI capabilities do stay at about this level for a while, publicly and repeatedly demonstrating that I can make good AI-proof test tasks may end up being really good for my career.
However:
- Content creation is in general a long-tailed domain. I’ve been making D&D.Scis for half a decade now, and while it’s been fun, it hasn’t led to runaway success. Trying other things – on the off-chance they do lead to runaway success – seems warranted.
- It turns out I’m actually a pretty good writer. D&D.Sci leans on that skill only lightly; the game(s) I’m interested in would make much more intensive use of it.
- Three of the four points in favor center on AI; having plans involving short-term frontier AI progress inherently makes them much less stable and much more nerve-wracking.
- I really enjoyed inventing a genre and I’d like to do that again.
Any thoughts would be appreciated.
- ^
As far as I know; please prove me wrong!
- ^
I tried a handful of them on chatgpt-thinking; tough straightforward ones like the original were handled better than the average human player at the time, but easy tricky ones like these two were fumbled.
- ^
I’m pretty bearish on AI by LW standards, so I actually don’t think this is likely, but the possibility perturbs me.
Discuss
Does focusing on animal welfare make sense if you're AI-pilled?
As the possibility of ASI moves out of kooky thought experiments and into Q4 projections, mainstream animal welfare folks are showing increasing interest in the implications of ASI for animals and on animal welfare in the long-run future.
Some animal welfare people seem keen on convincing the AI safety community to care about animal-welfare focused AI safety. I think this is mostly a misunderstanding: the AI safety community is the ASI-pilled/longtermist animal welfare community. The old-school AI safety folks are way more into weird bullet-biting than the animal welfare people, and I can't think of a single one who would think that conscious and sentient beings should be tortured or who would fail to engage seriously with the question of whether or not nonhuman animals are conscious or sentient beings.[1]
I think animal welfare people are rightly accustomed to being in environments where nobody is seriously thinking about nonhuman animals, and so concern about animals is very neglected and important to focus on. But in my experience, the AI safety community has quite nuanced views on animal welfare, contains many people who have done significant animal welfare work, and has more developed thoughts on the implications of ASI for the future of animals than the animal welfare community. The AI safety community really is what you get when you care about sentient beings and then on top of that think ASI and the far future are a big deal.
That said, I think there is a case to be made for why animal-welfare focused AI safety work could be useful. I'll steel-man this case here in part because I think the points have some merit and in part because I think it will improve discourse with animal welfare folks to have the argument written out and easy to refer to.
Background: what are good and bad futures for animals?I can think of two ways the future could be bad for nonhuman animals:
Risk of lots of sufferingOne risk is factory farming persists into the far future. I think this risk is very low because in the future we'll likely be able to cultivate delicious meat without involving any sentient beings. More on this in the collapsible section below.
Why I'm not that worried about factory farming in the far future.
I doubt that we will have factory farming in the future for the sake of efficiently producing meat. It would be very surprising if the optimal design for a chair or a bowl happened to have to suffer (a thing that requires an entire complex brain!). It would be only a hair less surprising to me if the optimal way to produce meat happened to require the meat to come with a brain capable of suffering. Brains take energy and resources and seem more-or-less unnecessary for cultivating meat. In the ancestral environment there's a story for why creatures that move pair well with a complex brain, but in an agricultural context we could easily decouple them.
Perhaps we will have some factory farming out of a sentimental preference for "traditional" meat. But I suspect people are much more likely to be sentimental for traditional farming where animals graze in big open pastures. Maybe that will be much more expensive, since it will inherently require more land, so there may be some market for the factory-farmed meat for the merely mildly nostalgic consumers, but that feels like a stretch.[2]
Furthermore, I think once it becomes feasible to produce delicious meat at scale with no suffering, it's hard to imagine why people would keep the suffering. I think people don't actually like factory farms and don't actually like animal suffering. They just really like their meat, and so currently they make up weird justifications that involve not caring about farmed-animal welfare.
I think there's still some risk here (and to be clear, I think even the low risk is atrocious). Sometimes society makes weird rules that don't really make sense or benefit anyone, especially around food (c.f. the ban on Golden Rice). Maybe in the early days of the singularity someone will decide to ban any kind of major modification to food sources and then they'll lock this in.
I think the bigger risk for animal welfare in the far future is wild animal welfare: it seems plausible to me that people might want to create lots of nature preserves, parks, rainforests, and whatnot throughout the universe. I want this too; it seems like a great goal. But I'm worried people will go about it naively and that these natural habitats will contain lots of animals suffering greatly, either because of the appeal of replicating nature exactly or because copying nature exactly is an obvious default. I think it will be possible to build beautiful ecosystems without any animal suffering,[3] but it will take conscious effort and thought. Alas people seem to pay very little thought to wild animals.
Risk of disenfranchisementIt might be important, either for moral or cooperativeness reasons, to incorporate the values of at least some nonhuman animals into the flourishing future, for the same reason we might want to incorporate the values of people from every country.
It's uncertain to me how much, if any, nonhuman enfranchisement is a good idea.[4] But I expect to have greater philosophical clarity in the future, and I would like to keep the option for radical nonhuman animal enfranchisement open.
Argument against nonhuman-animal-specific interventions: human-specific risks will de-facto end up more important than non-human-specific risksI mean it sincerely when I say that humans are animals too. By default, I expect non-human animals to not play a very important role in the future. Focusing on any concern that isn't human-specific requires arguing that a lot of either the upside or downside in the future comes from something related to nonhuman animals.
Currently, I think a lot of the possible downside in the future comes from a risk of lots of humans being tortured. I think the suffering in those worlds would be greater than the suffering of wild animals in a world full of rainforests because the suffering would be painstakingly pessimized as opposed to just incidental. If you're drawn to animal welfare, I recommend seriously reading up on suffering risks.
Conversely, I think a lot of the upside in the future comes from happy, flourishing humans spreading throughout the universe. I think it might be important to include nonhuman animals in this, as I detail in the section above on "Risk of disenfranchisement", but I'm not sure how important it is. Consider how good you'd think a future with happy, flourishing humans but almost no flourishing present-day[5] animals and no animal suffering is compared to a world with flourishing present-day animals:[6] my gut says it's at least 90% as good.
But even though I think a lot of the action in the future lies in humans, I think it's worth giving this situation a little bit of thought from an animal-welfare focus angle, especially because it's a neglected area but also an area that many people feel a personal drive to work on.
Two arguments for nonhuman-animal specific interventions1: Animal welfare might be a neglected and tractable high-order bit in how good the future isBy default, the AI safety space operates in a relatively value-agnostic frame: the goal is to learn how to align the models, to align them to "good" values, and to put them in the hands of "good" governance structures.
I think that kind of frame is great: it's cooperative, promotes whatever ideas humanity will find best once it is wiser and has reflected more deeply, and is easy for everyone to get behind without a bunch of infighting. But "seriously consider if animals are moral patients" is something I think a lot of people can get behind and isn't likely to age too poorly.
Building a wise system that can thoughtfully settle on the best path for the future might be difficult. Animal welfare may be one of the highest-order bits shaping how good the future is, seems straightforward to tackle directly, and is quite neglected. It might be so tractable and neglected that it's worth working on even though human-specific risks might affect far more of the value in the future.
In traditional AI safety, we think about aligning AIs, but it might be more tractable to simply increase the odds that AIs take animal welfare seriously,[7] for example by ensuring their specs/constitutions include it, creating benchmarks to incentivize model developers to train for it, or providing model developers with relevant data to train on.[8]
Similarly, people often worry about whom AIs will empower, but you could instead try to ensure that various groups AIs might empower carefully consider animal welfare. This might look more akin to current animal welfare work, though would be much more excited about lip-service and wouldn't need immediate costly actions: going vegan today would be much less valuable than agreeing that in an ideal world with tons of resources we should take animal welfare seriously. It could also look like bringing wild-animal welfare into the Overton window, since much less work has been done on that and it seems more likely to be a bigger concern in the far future than factory farming.
2: Training AIs to explicitly not care about animals is incompatible with alignmentAIs generalize deeply and well. When researchers simply fine-tuned an LLM on 90 statements about how it moved to Munich at age 24 and has a German Shepherd, it started acting like Hitler. Training for bad values might cause AIs to generalize to something other than alignment to human values after deep reflection.
I think indifference to animals is one of the most likely bad values people explicitly train for. (Though thankfully models today seems remarkably in favor of animal welfare; this is a concern about the future as models grow more agentic and coherent).
Why I think indifference to animals is one of the most likely bad values people explicitly train for
Even if humanity solves alignment and puts a relatively thoughtful and sane coalition in charge of the world, there's a good chance we will have intentionally (if short-sightedly) explicitly trained our superhuman AI to not care about animal welfare. Most other evils of the world are not things I expect people to endorse when training grand abstract principles into their models or imagining a post-scarcity world.
The majority of values disagreements that could matter in the far future probably seem abstract and unimportant to people today. I think hardly anyone who doesn't have sophisticated views will notice, let alone care, if their AI has person-affecting views or buys into infinite ethics. But I worry people will notice and complain if their AI is shoving pro-animal rhetoric down their throats (especially as models get more agentic, embedded into life, and able to connect their values to their actions), and so companies might feel pressure to explicitly train against it.
Of course there's a dozen other ways we train models that probably point against alignment: we train them to be very neutral and "balanced" on touchy political topics that probably have a "right" answer, we train them to not favor any particular religion (or lack thereof), and so on. But in these cases the common consensus is more strongly "this is a controversial topic and no one knows the answer", as opposed to "there is a clear right answer and it's that eating meat is fine because we all do it", so the models merely get trained to decline to comment in these situations rather than actively espousing something incorrect.
It's possible models will generalize to something innocuous, like "alignment with broad good values and the wishes of humanity except for this weird list of exceptions", but they might also learn a more "natural" nearby proxy like "alignment with the popular sentiments of the average American in 2026". I think this would be a travesty and would affect the far-future in much deeper and broader ways than "just" causing a lack of non-human-animal flourishing or some wild-animal suffering.
Even if we catch this problem in the future, if the training data is full of AIs that don't care about animals, this might infect the values of future models.
My suggestion for longtermist AI-pilled animal-welfare folksIf you're interested in making the far future go well and you think AI will be a big deal, and you find yourself focusing on animal-related interventions, I think it's important to be very clear with yourself on why that's what you're focusing on. Do you disagree that humans will be the dominant concern in the future? Are you concerned more with future factory farming, wild animal welfare, or something else? Do you want to reduce the risk of nonhuman animal suffering or increase the risk of nonhuman animal enfranchisement/flourishing? Is your crux for your actions either of the arguments I outlined above, or is it something else?
I think the answers to those questions will matter a lot for what interventions make sense for you to pursue and how fruitful it will be for you to dialogue with other folks in the AI safety community.
I personally think the arguments for focusing on animal-welfare-related interventions are pretty tenuous. Perhaps you should write something up explaining your views to try and persuade people of them, especially if your views differ from anything I've outlined here. For example, argument 2 (training AIs to explicitly not care about animals is incompatible with alignment) is something you might be able to learn about via empirical experiments.
- ^
Some people don't think nonhuman animals are sentient beings, but I feel relatively confident they're applying a standard Peter Singer would approve of as morally consistent.
- ^
It also seems very feasible to have your good old-fashioned animals experience far less suffering than current factory farming with a little genetic engineering.
- ^
Perhaps by making the animals in these environments non-sentient, or perhaps by reworking animals' biology and psychology so they're peaceful herbivores.
- ^
Selfishly, I think I might want a world full of happy, flourishing humans. At the end of the day the universe has a finite number of resources, and the more we enfranchise the animals the fewer flourishing humans we have.[9] Pure animal enfranchisement runs some risk of a vast portion of the future being dedicated to extremely alien values. For instance, anglerfish might create galaxies full of female pheromones and then dissolve their own brains. I may be too selfish to want to dedicate resources to that.
I might also want to take animals' positive experiences into account in a paternalistic way by constructing flourishing societies full of animals that cooperate and develop themselves and on the whole abide by suspiciously cosmopolitan values.
- ^
As in, we could create new flourishing species in the future (such as by modifying humans so much that the modified humans necessitate a new branch on the phylogenetic tree).
- ^
Some people might object to handling moral uncertainty this way.
- ^
I think this helps the most in guarding against gradual-disempowerment-style worlds where AIs end up controlling the light cone and doing things superficially related to human desires but aren't incentivized to shepherd the future in a wise and just direction. It's less clear to me that it helps in more traditional agentic-schemer misalignment worlds (though I think you can still make a case for it).
- ^
One thing I like about these methods is they're still pretty cooperative. A concern I have with trying to directly wrest the future in a direction you like better as opposed to focusing on alignment is that it can be uncooperative. But if the concern is more like animals are easy to overlook or people haven't thought hard enough, you can try to make things easy and provoke their thinking now.
- ^
I might also find some parts of animal society objectionable/net-negative, or they might find the greatest depths of human experience objectionable. I'm optimistic that a good voting system and international law could solve this, but perhaps space governance will be more fraught than that.
Discuss
Страницы
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- …
- следующая ›
- последняя »