Вы здесь
Новости LessWrong.com
The Open Strategy Dictator Game: An Experiment in Transparent Cooperation
In 1980, Robert Axelrod invited researchers around the world to submit computer programs to play the Iterated Prisoner’s Dilemma.
The results — where Tit for Tat famously won — transformed how we think about cooperation.
What mattered most wasn’t intelligence or aggression, but a few simple principles: be nice, retaliate, forgive, and be clear.
That insight reshaped evolutionary game theory and inspired decades of work in economics and social science.
But Axelrod’s agents were opaque. They couldn’t read each other’s source code.
Enter the Open Strategy Dictator GameThe Open Strategy Dictator Game asks: What happens when strategies are fully visible?
Each participant submits a natural-language strategy description — a few paragraphs of text explaining how their agent behaves.
Every round, a large language model (Claude Sonnet 4.5) simulates a one-shot dictator game where one strategy divides a fixed endowment between itself and a recipient.
Crucially, the dictator’s decision prompt includes the text of the other player's strategy.
In other words: you decide how to act knowing exactly who you’re facing — and they know you know.
Utilities are logarithmic in the received share, so the game rewards fairness rather than zero-sum aggression.
And since the tournament is round-robin, each strategy will also appear as a recipient many times — facing both selfish exploiters and conditional cooperators.
Why it matters
This setting sits at the intersection of three interesting topics:
- Axelrod’s tournament tradition: exploring emergent cooperation through open submissions and simple formal rules.
- Functional Decision Theory: where agents can cooperate not through causal exchange, but through reasoning about logical correlations between decision processes.
- AI and superintelligence: mind reading may become a reality in the coming decades, both AI-to-AI and AI-to-human.
What we might learn
- Do open strategies converge toward conditional cooperation, reciprocal fairness, or preemptive exploitation?
- Does transparency stabilize cooperation, or merely expose it to new failure modes?
- Can language models acting as referees (or even as agents) model the kind of reflective equilibrium that FDT envisions?
If Axelrod’s tournament showed how cooperation emerges in the dark,
the Open Strategy Dictator Game explores how it survives in the light.
You can join the experiment, submit your strategy, and help test whether open cooperation can still win when everyone can read your mind.
Github: https://github.com/michaelrglass/os-fdt
Simple Initial Tournament Results: https://michaelrglass.github.io/os-fdt/
Discuss
DC/Maryland Secular Solstice
We will be having a Secular Solstice event this year as usual! Please join us for a Solstice ritual with songs and speeches followed by an afterparty at the same location (a farmhouse that does concert rentals).
This year we're in a fairly rural spot, so folks without cars will likely need a ride. Please see here to give/get a ride: CARPOOL SPREADSHEET
Kids are welcome; we plan to set up kid-friendly space away from the main ritual.
Doors open at 4, ritual begins at 4:30pm.
Watch this space, more details to follow. Contact Maia or Rivka with questions. Hope to see you there!
Discuss
What I learned building a language-learning app
I'm building an app to teach myself French, and I wanted to share what I learned in the process of creating it!
We are slowly entering an era of digital personalized learning, which will be much more efficient and much more fun than traditional classroom-based learning.
Creating a digital "learning system" is at least as hard as tutoring someone. Everything that you need to know to be a good tutor, you need to implement in code in your digital learning system. How to tell when the learner is stuck, what they need to review, and so on, all needs to be specified in advance rather than relying on your intuition. But digital systems has the advantage that it can scale to a potentially unlimited number of people, whereas you can only tutor one person at a time.
Because we spend so much of our lives learning, and because classroom-based learning is so much less effective than having a good tutor is, it follows that creating an digital learning system is extremely valuable, provided that people actually use it and that it gets close to the efficiency of having a tutor.
The foundation of effective learning systems is almost always a technique called "spaced repetition". This is partially because a computer can implement spaced repetition even better than a human can, and because it is extremely effective. Spaced repetition is not applicable to every field, but it is extremely relevant to language learning, so it deserves a quick introduction for those readers who are not familiar with it.
What is Spaced Repetition?(If you already know what spaced repetition is, you can skip this section.)
Spaced repetition is a learning system normally used with flashcards. The naïve way of using flashcards is to review all of them every session. But as you start to have thousands of flashcards, this becomes a huge waste of time. No one can learn a lot of flashcards this way – your workload per day would be proportional to the number of cards.
A much more effective way would be to only review each flashcard right as you were about to forget it. The effect is that newly introduced and more difficult flashcards are shown more frequently, while older and less difficult flashcards are shown less frequently[1]. This is accomplished using an algorithm called a "scheduler". You tell the scheduler each time you review a card, and you say whether you successfully remembered it or forgot it. The scheduler uses this information to try to figure out the difficulty of the card, and from that it can figure out the best next time to show it to you. For a very difficult card you just forgot, it might show it to you again in 3 minutes. For an easy card you just remembered, it might show it to you again in 3 years. With spaced repetition, even if you add new cards every day, your workload can remain basically constant.
Schedulers are simple and adaptable. You don't need a specialized scheduler for every task. The same one can be used for memorizing anatomy and for memorizing words. The best publicly available scheduler I know of is called FSRS.
RetentionA great learning system that nobody wants to use is, in some sense, not very effective. This section covers how to make a system that people actually use.
User interfaceAnki and other flashcard-based SRS apps solve a very general problem. You can basically use it to learn anything. The research company Ink and Switch suggests [dividing software into "knives" and "avocado slicers"](https://www.inkandswitch.com/essay/malleable-software/).
1. Knives solve a general problem, but are sometimes difficult to learn how to use (and sometimes can be used incorrectly or ineffectively).
2. Avocado slicers are trivial to use, but are only good for one specific task.
In this analogy, Anki is a knife and my application is an avocado slicer. It is worth thinking of how to make a system like mine more general – maybe a souped-up version of Anki that supported LLM-based grading would make my system just a special case.
That said, there is one feature missing from Anki that is just critical: notifications that remind you to study. The most common reason people stop studying is because they fall out of the habit and simply forget. This use case for notifications is profoundly prosocial, as it removes the possibility of simply forgetting to study. While it may seem small or extraneous, any learning app that has these notifications will see its users much more likely to succeed. Sometimes the things that make the biggest differences are the small things that seem like they shouldn't matter.
The Feeling of ProgressThere is a classic "formula" for motivation, which is simplistic but quite predictive.
(Graphic from How We Use the Procrastination Equation by Alex Vermeer and Jimmy Rintjema)Expectancy × Value is simply the expected value of the task. Impulsiveness is an innate/exogenous quality of the person whose motivation is in question, and Delay indicates now far that value is in the future. The result is our overall motivation to do the task.
The one change I would make to this classic formula is to modify Value to be explicitly (Reward - Effort).
I encourage interested readers to try an app like Duolingo with this equation in mind. It is an incredible case study in how to optimize this formula. Duolingo provides the maximum feeling of reward for a minimum of effort. In each session, you learn 1-2 new words, then practice them many times. Then, the app congratulates you for having learned the new words. This is not very effortful (it's not difficult to remember a word you learned a few seconds ago), and feels very productive (at the end of the session you feel like you know the word very well).
The issue is that it is not a very time-efficient way to learn. The most efficient thing you can do is recall a word you were about to forget. This has a huge impact on how long you will remember it. But recalling a word you were about to forget, from deep in your long-term memory, takes a surprising amount of effort. Recalling a word you learned a few seconds ago from your short term memory takes almost no effort, but it doesn't induce long-term retention.[2]
I don't think there's a way around this. Learning a language quickly will always require effort at the beginning. Our only option is to eliminate all causes of unproductive or "wasted" effort, and then to make the reward as great and as visible as possible, which brings me to the next section.
The feeling of easeTo make matters worse, we are accustomed to things getting easier as we get better at them. When we learn to ride a bike, at first it is very difficult to balance, but over time we get better and the task feels easier, until riding a bike becomes an incredibly relaxing and pleasurable activity that requires virtually no conscious thought or attention.
But, most likely everyone reading this is at about the same level of bike-riding ability. Probably few of us can do wheelies or ride backwards or ride down the center of a balance beam. This is because while we got better, the task stayed the same. Once the task became easy, we stopped learning. This works because the task of riding a bike isn't that difficult (most children can learn to do it in a day or so), and most people have no interest in more advanced skills that can't be learned in a day.
But that doesn't apply to language learning. You cannot become remotely competent in a language in one day. The only option is to start with super easy challenges, and then replace them with gradually harder challenges as you master the easy ones.
This creates a problem. A highly effective system keeps the difficulty at the perfect level at all times. But this means it always feels moderately challenging. While it shouldn't feel overwhelming, it probably won't feel easy either. Time spent doing things that are easy is rarely productive, but it's important because it makes us feel like we have actually progressed, actually learned something.
I can think of two classes of solution to this problem:
1. "external solutions," like progress bars and meters. I call these external because they're controlled by the app and don't necessarily reflect real progress. But they can be very motivating.
2. "internal solutions", where the user demonstrates to themselves that they can now do a task they couldn't do before. These are "internal" because the user internally remembers a time when they couldn't do the task and compares it to the present (where they hopefully can).
One idea I'm enamored with is a progress bar is indicating how close the learners is to being able to read a particular book or watch a particular movie. This would be very motivating, especially if, once completing your progress bar, the user actually was provided the book or movie. With the right choices of books and movies, the user might "unlock" one every couple weeks.
(This could be very useful in a classroom setting, but it's a bit harder for a free language learning app that doesn't have the rights to a bunch of books and movies. Regardless, "You know 62% of the vocabulary in _Le Petit Prince_" would be incredibly motivating.)
A simpler idea is to show a short paragraph to the user that they can't yet read and tell them, "After a few days of practice, you'll be able to read this no problem." Then, after those few days of practice have passed, you provide the same paragraph to the user. Hopefully they'll remember, "Oh, I used to not be able to read this and now I can!"
EffectivenessOf course, even if people use your learning system, it's not a good system unless it results in a lot of actual learning. This section covers that
Breaking concepts into small chunksA key to effective teaching is to break concepts into the smallest chunks possible, and then introduce these one at a time. You might see this in math, where a concept introduced all at once would be very difficult to understand, but if it were broken up into many tiny chunks with each chunk introduced one at a time, the learner would find each chunk simple to understand and get through the whole thing without a problem. The smaller the chunk size, the more people will be able to learn the subject and the less effort it will take to learn it.
As an example, the site Math Academy has broken up a large subset of math into small "knowledge chunks", and they have these amazing graphs that show all the different chunks they represent in their system and the dependencies they have on one another:
Here is Math Academy's full graph:
Fortunately we don't need anything quite so sophisticated for language learning haha. But Math Academy and their dedication to breaking things into tiny chunks was a big inspiration for my project.
Word ChunksWith something like math. It can actually be a lot of work to break a complicated concept into these tiny chunks. With language learning, we have the advantage that there's a ready-made concept for us to use as our basis for chunking: words. That is, you can teach/learn words one at a time.
You might think that this is so obvious that it barely is worth mentioning. But I bring it up to say that we can do better. It turns out that words are not the smallest possible chunk that we can use as our basis for teaching a language.
Take the word "rose". As a verb, it can be a conjugation of "to rise". But as a noun, it is a type of flower. Or take the word "have". As a verb, it can mean "to possess" (I have eggs), or as an auxiliary it can mean "did in the past" (I have gone to the store).
So each word can be broken up into many (word, meaning) pairs. These make smaller chunks than simply using words.
The problem is that (word, meaning) pairs are difficult to work with programmatically. I don't know how to take a sentence and identify the meaning of each word, in such a way that words with the same meaning will be considered equivalent. So I ended up using (word, part-of-speech, lemma) triples instead. (A lemma is the "dictionary form" of a word, for example the lemma of "runs" is "run".) Using (word, part-of-speech, lemma) as our chunks is worse than (word, meaning), because it doesn't differentiate between the different meanings of "bat", "bank", "spring", etc., but it's still a huge improvement over simple word.
Multiword ChunksEven if you know the meaning of every word in a sentence, you might still not know the meaning of that sentence. This is because words can take on their new meanings when used as part of a phrase. For example, if someone says "you'd better not", you'd better know what 'd better means separately from the meaning of had and better.
Wiktionary calls these "multiword terms", and they have an very extensive lists of them for many languages. So in addition to teaching (word, part-of-speech, lemma), you should also teach "multiword terms".
Introducing chunks in the right orderThe obvious order is to teach these chunks in proportion to how common they are. For some reason, lots of apps and teachers fail at this point. For example, how many people learned the Spanish words for every color very early on in Spanish class in school? You might know that "azul" means "blue", but this is not actually a very common word. In fact, it's only the 957th most common word in my database. Meaning that there are 956 words you should learn before "azul", if you're learning in order of frequency.
This happens because classes want to be able to teach "real sentences" like "the house is blue". But those kind of sentences are actually pretty rare in real life. On the other hand, if you know the top 100 most common English words, you can say complex things like "how could you do this to me?". Imagine you're watching a movie in English, are you more likely to hear "the house is blue" or "how could you do this to me?"
I don't know if there's a term for the difference between these types of sentences. I would say that "the house is blue" is a "concrete sentence" and "how could you do this to me" is an "abstract sentence". Concrete sentences have lots of nouns and adjectives. Abstract sentences have lots of pronouns and verbs. At the beginning of your language learning journey, if you learn words in order of frequency, you will be able to say lots of abstract sentences and not very many concrete sentences. This is a good strategy, because abstract sentences are very common in real life and you can understand most of them by learning just a hundred or two words. (Compared to the thousands of words needed to have a good understanding of concrete sentences.)
Final notesThe above mostly focused on chunks needed to effectively learn to read a language whose writing system you already know. More types of chunks are needed for listening and speaking and writing. And when an english speaker wants to learn Japanese, they would also benefit from chunks specialized to the Japanese writing systems. I don't have a fully satisfactory solution to these parts of the system, so I can't go into depth here.
It's my experience that (word, part-of-speech, lemma) work very well for English, French, Spanish, and German. Languages like Turkish, Mandarin, and Japanese are different enough that it would probably work more poorly for them. Unfortunately, I don't know any of these languages well enough to speculate what would work well for teaching them.
Testing chunks
The most common spaced repetition strategy is simple flashcards. Each knowledge chunk gets one flashcard and each flashcard corresponds to one chunk. When it's time to review a chunk, the flashcard is shown.
This is a very simple system, but it has the advantage of being extremely flexible and extremely effective. But it has some flaws. One issue is that it's not obvious how to get "sentence practice" with flashcards. That is, you might know what "I" means, "can" means, and "go" means. But it's important to regularly see the words in sentences such as "I can go".
This is an area where we need to separate our concepts. Knowledge chunks should actually be something internal to the system, not overly exposed to the user. What the user actually sees are "challenges", each of which tests one or more knowledge chunks.
For example, what "I" means, "can" means, and "go" means are all knowledge chunks. But "translate the sentence 'I can go.'" is a challenge that tests all three of those chunks.
The user will respond to the challenge, and the system is responsible for taking the user's response and determining if it counts as a successful or failed repetition for each chunk. I call this part "grading". One reason that flashcards were so popular is because grading is very easy. (You just ask the user!) On the other hand, it used to be difficult to automatically grade a user's translation.
Nowadays, LLMs are very useful for grading. They can look at the provided sentence and the user's translation, and tell you exactly what words the users translated correctly and which ones they failed at translating. LLMs are not yet equally good for all languages, and I've specifically heard lots of horror stories w.r.t. using them for learning Japanese. But for indo-european languages with lots of training data, they are fantastic.
ConclusionThere's a lot I wanted to get to in this post but wasn't able to (it's already long enough IMO). For example, something important is a feeling of autonomy, specifically in the user being able to choose what they study. This has the effect of making the user feel more in control of what they're learning, which makes it much more fun. There's also the question of "placement tests", which are important to avoid wasting the time of users who already have some pre-existing knowledge from previous attempts at learning. I would like to get into all of these in a future post, but this is enough for now. I hope you learned something, and I hope you enjoyed reading!
- ^
This part of the description was taken from Wikipedia.
- ^
Of course, I'm sure Duolingo has thought about this issue and there are ways to use Duolingo in a higher effort, more time-efficient way.
Discuss
Andrej Karpathy on LLM cognitive deficits
Excerpt from Dwarkesh Patel's interview with Andrej Karpathy that I think is valuable for LessWrong-ers to read. I think he's basically correct. Emphasis in bold is mine.
Andrej Karpathy 00:29:53
I guess I built the repository over a period of a bit more than a month. I would say there are three major classes of how people interact with code right now. Some people completely reject all of LLMs and they are just writing by scratch. This is probably not the right thing to do anymore.
The intermediate part, which is where I am, is you still write a lot of things from scratch, but you use the autocomplete that’s available now from these models. So when you start writing out a little piece of it, it will autocomplete for you and you can just tap through. Most of the time it’s correct, sometimes it’s not, and you edit it. But you’re still very much the architect of what you’re writing. Then there’s the vibe coding: “Hi, please implement this or that,” enter, and then let the model do it. That’s the agents.
I do feel like the agents work in very specific settings, and I would use them in specific settings. But these are all tools available to you and you have to learn what they’re good at, what they’re not good at, and when to use them. So the agents are pretty good, for example, if you’re doing boilerplate stuff. Boilerplate code that’s just copy-paste stuff, they’re very good at that. They’re very good at stuff that occurs very often on the Internet because there are lots of examples of it in the training sets of these models. There are features of things where the models will do very well.
I would say nanochat is not an example of those because it’s a fairly unique repository. There’s not that much code in the way that I’ve structured it. It’s not boilerplate code. It’s intellectually intense code almost, and everything has to be very precisely arranged. The models have so many cognitive deficits. One example, they kept misunderstanding the code because they have too much memory from all the typical ways of doing things on the Internet that I just wasn’t adopting. The models, for example—I don’t know if I want to get into the full details—but they kept thinking I’m writing normal code, and I’m not.
Dwarkesh Patel 00:31:49
Maybe one example?
Andrej Karpathy 00:31:51
You have eight GPUs that are all doing forward, backwards. The way to synchronize gradients between them is to use a Distributed Data Parallel container of PyTorch, which automatically as you’re doing the backward, it will start communicating and synchronizing gradients. I didn’t use DDP because I didn’t want to use it, because it’s not necessary. I threw it out and wrote my own synchronization routine that’s inside the step of the optimizer. The models were trying to get me to use the DDP container. They were very concerned. This gets way too technical, but I wasn’t using that container because I don’t need it and I have a custom implementation of something like it.
Dwarkesh Patel 00:32:26
They just couldn’t internalize that you had your own.
Andrej Karpathy 00:32:28
They couldn’t get past that. They kept trying to mess up the style. They’re way too over-defensive. They make all these try-catch statements. They keep trying to make a production code base, and I have a bunch of assumptions in my code, and it’s okay. I don’t need all this extra stuff in there. So I feel like they’re bloating the code base, bloating the complexity, they keep misunderstanding, they’re using deprecated APIs a bunch of times. It’s a total mess. It’s just not net useful. I can go in, I can clean it up, but it’s not net useful.
I also feel like it’s annoying to have to type out what I want in English because it’s too much typing. If I just navigate to the part of the code that I want, and I go where I know the code has to appear and I start typing out the first few letters, autocomplete gets it and just gives you the code. This is a very high information bandwidth to specify what you want. You point to the code where you want it, you type out the first few pieces, and the model will complete it.
So what I mean is, these models are good in certain parts of the stack. There are two examples where I use the models that I think are illustrative. One was when I generated the report. That’s more boilerplate-y, so I partially vibe-coded some of that stuff. That was fine because it’s not mission-critical stuff, and it works fine.
The other part is when I was rewriting the tokenizer in Rust. I’m not as good at Rust because I’m fairly new to Rust. So there’s a bit of vibe coding going on when I was writing some of the Rust code. But I had a Python implementation that I fully understand, and I’m just making sure I’m making a more efficient version of it, and I have tests so I feel safer doing that stuff. They increase accessibility to languages or paradigms that you might not be as familiar with. I think they’re very helpful there as well. There’s a ton of Rust code out there, the models are pretty good at it. I happen to not know that much about it, so the models are very useful there.
Dwarkesh Patel 00:34:23
The reason this question is so interesting is because the main story people have about AI exploding and getting to superintelligence pretty rapidly is AI automating AI engineering and AI research. They’ll look at the fact that you can have Claude Code and make entire applications, CRUD applications, from scratch and think, “If you had this same capability inside of OpenAI and DeepMind and everything, just imagine a thousand of you or a million of you in parallel, finding little architectural tweaks.”
It’s quite interesting to hear you say that this is the thing they’re asymmetrically worse at. It’s quite relevant to forecasting whether the AI 2027-type explosion is likely to happen anytime soon.
Andrej Karpathy 00:35:05
That’s a good way of putting it, and you’re getting at why my timelines are a bit longer. You’re right. They’re not very good at code that has never been written before, maybe it’s one way to put it, which is what we’re trying to achieve when we’re building these models.
Dwarkesh Patel 00:35:19
Very naive question, but the architectural tweaks that you’re adding to nanochat, they’re in a paper somewhere, right? They might even be in a repo somewhere. Is it surprising that they aren’t able to integrate that into whenever you’re like, “Add RoPE embeddings” or something, they do that in the wrong way?
Andrej Karpathy 00:35:42
It’s tough. They know, but they don’t fully know. They don’t know how to fully integrate it into the repo and your style and your code and your place, and some of the custom things that you’re doing and how it fits with all the assumptions of the repository. They do have some knowledge, but they haven’t gotten to the place where they can integrate it and make sense of it.
A lot of the stuff continues to improve. Currently, the state-of-the-art model that I go to is the GPT-5 Pro, and that’s a very powerful model. If I have 20 minutes, I will copy-paste my entire repo and I go to GPT-5 Pro, the oracle, for some questions. Often it’s not too bad and surprisingly good compared to what existed a year ago.
Overall, the models are not there. I feel like the industry is making too big of a jump and is trying to pretend like this is amazing, and it’s not. It’s slop. They’re not coming to terms with it, and maybe they’re trying to fundraise or something like that. I’m not sure what’s going on, but we’re at this intermediate stage. The models are amazing. They still need a lot of work. For now, autocomplete is my sweet spot. But sometimes, for some types of code, I will go to an LLM agent.
Discuss
Consciousness as a Distributed Ponzi Scheme
The term "distributed Ponzi scheme" here is not derogatory -- many currencies are distributed Ponzi schemes, and that seems fine.[1] I use this terminology partly to be funny, and mostly to point out that there's a sort of circular reasoning involved.[2] It is only rational to think that money is valuable because other people expect it to be valuable. There doesn't need to be some root source of the value (EG a government which requires taxes to be paid in the currency).
So why am I claiming that consciousness has this circular quality?
The basic claim here is that there's a cluster of related concepts -- agency, meaning, consciousness, purpose, belief, reference/semantics -- have circular definitions and circular justifications. If one tries to reduce any of these definitions to material/causal notions, I claim, you'll end up sneaking in some other notion from this cluster. This is an OK way for things to be. Definitions and justifications have to be circular at some point, or else terminate in some unexplained things, or else create an infinite chain.
The foundational idea here is the intentional stance: the idea that agency is a useful perspective. There is no fundamental physical structure which constitutes agency; agency is multiply-realizable (like computation), and the various instances of agency are best unified by whether it is useful to think of something as an agent. Another way of putting it: agency is best understood through cognitive reduction, not physical reduction.
You can see the circularity: we need to postulate a mind in order to do cognitive reduction; however, "mind" is the sort of thing we are trying to reduce.
Hence, I have a methodological disagreement with some philosophers. While I do think it is good to try and limit the baggage of a philosophical account, I don't expect it to be fruitful to try and totally eliminate agency from one's explanation of agency (except in so far as it provides inspiration or clarifies the landscape).
For example, my understanding is that most teleosemantic theories try to ground our notions of purpose/agency in biological evolution. My feeling is that this is overly restrictive. If successful, I think the success will come from interpreting evolution as agentic (ascribing goals to natural selection), rather than fully grounding the notion of purpose in purposeless things. The move is also liable to miss some cases.[3] I prefer a version of teleosemantics which ascribes semantics to anything optimized for map-territory correspondence, rather than restricting the optimization to have originally come from natural selection.
Judging the consciousness or moral status of AIHumans are prone to argue about what/who to include in our "circle of concern" (eg fascists argued for drawing the circle at an ethnostate, vegans argue for including animals); this is perhaps because we evolved to do so (with coalition dynamics being a major survival consideration). There seems to be a strong coalition around consciousness; EG, when discussing the inclusion or exclusion of a particular animal from our circle of concern, the consciousness of said animal will often be questioned. Consciousness has many definitions, but for my discussion here I will limit the scope to "there is something it is like to be X" (X has an internal experience).
When is it explanatorily useful to posit an internal experience?
I don't think all agents necessarily have internal experiences. A chess-playing AI can usefully be thought of as an agent. It can usefully be described as having beliefs about what will happen in the game, as well as plans and goals. However, it fails to reflect on these things in a relevant way. I would say it doesn't think of itself as having goals, beliefs, etc. It lacks an adequately sophisticated self-model.
Do modern LLMs have a self-model of the sort I'm describing?
I think LLMs can be usefully described as believing things. They have representations of the world, in a teleosemantic sense: there has been some optimization for map-territory correspondence, and an LLM agent can even do some active reference-maintenance (adjusting its beliefs to better fit reality). Hence, when you talk to LLMs, I think both sides of the conversation are often talking "about things" (there is a certain amount of mutual understanding).
You can also talk to LLMs about their internal experience. You can ask LLMs to unpack their reasoning process, to tell you about their subjective feelings, to do phenomenological experiments, to try meditating and tell you what it is like, etc.
However, my impression so far is that when you do so, the LLMs aren't very good at modeling themselves.
My notion of semantics doesn't require the LLMs to have "actual direct access" to their internal states in order for their assertions about feelings/desires/etc to be meaningful. It would be enough if they merely had decently good models of themselves. However, it seems to me like their self-models are quite poor (much poorer than humans). They are essentially just making stuff up (and worse than humans would).
This makes sense. I don't think anything in their training incentivizes self-modeling of this kind. The pre-training step incentivizes them to model the internal states of humans, not themselves; what they are thinking has no influence on the static training data. This creates a heavy prior for "faking it", ie, making up stuff a human might say when asked about internal state. I doubt the other parts of training do much to correct for this.
However, I don't rule out this capability emerging naturally as LLMs continue to improve. Skill at self-modeling might emerge as a consequence of more general skill at world-modeling.
Ultimately, I'm merely pointing out one factor to consider when evaluating the moral status of AI. I'm not claiming that consciousness is the ultimate determiner of moral status. I'm not claiming that "something it is like to be X" is the ultimate definition of consciousness. I won't even argue that self-modeling is necessarily the best way to think about whether there's "something it is like to be X".
What I do want to say is that there's some circularity here. The "conscious" beings are those usefully modeled as such, but this requires some scope of observers (useful to who?). Our decisions about this might in turn be influenced to some extent by our conceptions of consciousness. Thus, consciousness has some aspects of a Keynsian beauty contest: the conscious decide who to interpret as conscious. It isn't totally arbitrary, though. It is our beauty contest; we should try to judge it well.
- ^
Particularly deflationary currencies such as gold and bitcoin, if you interpret a "Ponzi scheme" as something whose value is only propped up by the expectation that it will continue to increase in value.
I'm not really so focused on the deflationary part, however (I'm not sure how I'd want to analogize that to consciousness/agency). For my purposes the main thing is that the value is propped up by the expectation that there will be value in the future, rather than some "intrinsic" value.
- ^
I'm not being so careful about "circularity" in this post so as to cleanly distinguish between circular reasoning vs circular definitions.
- ^
EG, the typical objection to such versions of teleosemantics are swamp-man counterexamples: suppose a thermodynamic miracle occurred, with a perfectly formed human spontaneously assembling out of matter in a swamp. This person's thoughts cannot be ascribed semantics in a way that depends on evolution. My version of teleosemantics would be comfortable ascribing meaning to such a person's thoughts, because those thoughts would still be well-understood as being optimized for map-territory correspondence, much like a chess grandmaster's moves are well-explained by the desire to win.
Discuss
Maat - Intro Post
This is the first post in the "Map articulating all talking (Maat)" sequence in which I discuss mass communication, a particular class of issues affecting it, and a sketch of an experimental social media platform which might alleviate these issues.
I'm hoping this sequence will be of interest to people generally interested in these issues, developers who may wish to draw inspiration from these ideas or contribute to a Maat project, and start up funders or founders who would be interested in funding or founding a project based on these ideas. I am also interested in using these posts to further develop my thinking and gain awareness of related concepts and projects, so please share your own ideas in the comments.
I will first write a post or two focused on mass communication in general to serve as grounding context. I'll then discuss the class of issues I'm interested in, those which are not in the technological implementation or in human psychology, but issues with the structure and dynamics of communication itself. I have 5 example issues to discuss:
- Inferential distance & conversation/idea complexity is not readily or commonly communicated.
- Similar conversations happen independently of one another.
- Finding novel ideas in mass conversation is like searching for a needle in a haystack. And it's bone needle, not a magnetic one.
- Emphatic arguments are selected for instead of effective ones.
- Tracking claims & predictions is made difficult by the fuzzy nature of statements and mercurial focus shifting in response to a constant stream of new information.
After those I plan to describe a rough plan for a social media platform that might address some of these issues, thereby improving the efficiency and effectiveness of mass communication.
Discuss
Variously Effective Altruism
This post is a roundup of various things related to philanthropy, as you often find in the full monthly roundup. Preventing Value Drift Peter Thiel warned Elon Musk to ditch donating to The Giving Pledge because Bill Gates will give his wealth away ‘to left-wing nonprofits.’ As John Arnold points out, this seems highly confused. The Giving Pledge is a promise to give away your money, not a promise to let Bill Gates give away your money. The core concern, that your money ends up going to causes one does not believe in (and probably highly inefficiently at that) seems real, once you send money into a foundation ecosystem it by default gets captured by foundation style people. As he points out, ‘let my children handle it’ is not a great answer, and would be especially poor for Musk given the likely disagreements over values, especially if you don’t actually give those children that much free and clear (and thus, are being relatively uncooperative, so why should they honor your preferences?). There are no easy answers. Maximizing Good Makes People Look Bad A new paper goes Full Hanson with the question Does Maximizing Good Make People Look Bad? They answer yes, if you give deliberately rather than empathetically and seek to maximize impact this is viewed as less moral and you are seen as a less desirable social partner, and donors estimate this effect roughly correctly. Which makes sense if you consider that one advantage of being a social partner is that you can direct your partners with social and emotional appeals, and thereby extract their resources. As with so many other things, you can be someone or do something, and if you focus on one you have to sacrifice some of the other. This is one place where the core idea of Effective Altruism is pretty great. You create a community of people where it is socially desirable to be deliberative, and scorn is put on those who are empathic instead. If that was all EA did, without trying to drum up more resources or direct how people deliberated? That alone is a big win. No We Have No Tuition UATX eliminates tuition forever as the result of a $100 million gift from Jeff Yass. Well, hopefully. This gift alone doesn’t fund that, they’re counting on future donations from grateful students, so they might have to back out of this the way Rice had to in 1965. One could ask, given schools like Harvard, Yale and Stanford make such bets and have wildly successful graduates who give lots of money, and still charge tuition, what is the difference? In general giving to your Alma Mater or another university is highly ineffective altruism. One can plausibly argue that fully paying for everyone’s tuition, with an agreement to that effect, is a lot better than giving to the general university fund, especially if you’re hoping for a cascade effect. It would be a highly positive cultural shift if selective colleges stopped charging tuition. Is that the best use of $100 million? I mean, obviously not even close, but it’s not clear that it is up against the better uses. Will MacAskill and the Dangers of PR Focus Will MacAskill asks what Effective Altruism should do now that AI is making rapid progress and there is a large distinct AI safety movement. He argues EA should embrace the mission of making the transition to a post-AGI society go well. Will MacAskill: This third way will require a lot of intellectual nimbleness and willingness to change our minds. Post-FTX, much of EA adopted a “PR mentality” that I think has lingered and is counterproductive. EA is intrinsically controversial because we say things that aren’t popular — and given recent events, we’ll be controversial regardless. This is liberating: we can focus on making arguments we think are true and important, with bravery and honesty, rather than constraining ourselves with excessive caution. He does not mention until later the obvious objection, which is that the Effective Altruist brand is toxic, to the point that the label is used as a political accusation. No, this isn’t primarily because EA is ‘inherently controversial’ for the things it advocates. It is primarily because, as I understand things:
- EA tells those who don’t agree with EA, and who don’t allocate substantial resources to EA causes, that they are bad, and that they should feel bad.
- EA (long before FTX) adopted in a broad range of ways the ‘PR mentality’ MacAskill rightfully criticizes, and other hostile actions it has taken, also FTX.
- FTX, which was severely mishandled.
- Active intentional scapegoating and fear mongering campaigns.
- Yes, the things it advocates for, and the extent to which it and components of it have pushed for them, but this is one of many elements.
- global health & development
- factory farming
- AI safety
- AI character[5]
- AI welfare / digital minds
- the economic and political rights of AIs
- AI-driven persuasion and epistemic disruption
- AI for better reasoning, decision-making and coordination
- the risk of (AI-enabled) human coups
- democracy preservation
- gradual disempowerment
- biorisk
- space governance
- s-risks
- macrostrategy
- meta
- It’s double-sided. That might seem obvious, but a lot of conferences just print on one side. I guess that saves a few cents, but it means half the time the badge is useless.
- It’s on a lanyard that’s the right length. It came to mid-torso for most people, making it easy to see and catch a glimpse of without looking at people in a weird way.
- it’s a) attractive and b) not on a safety pin so people actually want to wear it.
- Most importantly, the most important bit of information–the wearer’s first name–is printed in a maximally large font across the top. You could easily see it from 10 feet away. Again, it might seem obvious… but I go to a lot of events with 14 point printed names.
- The other information is fine to have in smaller fonts. Job title, organization, location… those are all secondary items. The most important thing is the wearer’s name, and the most important part of that is the first name.
- After all of the utilitarian questions have been answered… it’s attractive. The color scheme and graphic branding is consistent with the rest of the conference. But I stress, this is the least important part of the badge.
Discuss
Why does everything feel so urgent?
I just saw someone on an electric unicycle texting while going through an intersection. Their body was so exposed, so fragile, zipping through that intersection right next to all those cars.
What was so urgent that it couldn’t wait for them to pull over?
I mean, I know the answer is nothing, almost certainly. They just got a notification so they pulled out their phone and then they were on it. Or maybe they were just bored while hurtling through traffic.
Okay but then the question is: Why do notifications feel so urgent?
When we worry about not looking at our phones, what is it that we’re so afraid we’ll miss?
There are some things your phone can tell you that are important, like that a loved one has a terminal illness. It matters that you see that text, but if you’re asleep for eight hours and don’t see it til you wake up, it doesn’t really make a difference.
There are some things your phone can tell you that are urgent, like someone changing plans at the last minute. But that is not so urgent that you couldn’t wait to pull over.
What’s both important and urgent? Natural disasters, maybe? But a bad storm is obvious, and an earthquake is hard to forewarn about.
The thing I always used to worry about was missing a call that something terrible had happened — that someone was in the hospital.
When I was in my second year of college, my mom’s mom — the only grandparent I ever really knew — died. I was in an evening class that I hadn’t told my mom about, studying emergency medical response. My phone was in my backpack, but I happened to check it once halfway through the class, and I saw that I had multiple missed calls from my mom, and a text that just said “Call me. Now.” I stepped out into the hallway, and talked to my mom, who was being driven the four hours to the hospital where her mother was unresponsive.
In a way, this was urgent — I needed to make travel plans for the funeral within the next two days, and my grandmother had wanted me to choose a poem for the program, and my mom just really wanted to reach me in that moment. But the delay of 15 minutes when I hadn’t checked my phone didn’t matter. And when I called my sister, who also needed to make travel plans, it was several hours later (because I’d stayed to finish the class), and no one had called her yet at all. Even though someone was literally in the hospital, dying, there was no to-the-minute or even to-the-hour urgency.
And yet, every text we receive feels so urgent that we let ourselves get absorbed in our phones while our bodies are completely vulnerable amidst a mass of moving cars.
There’s science behind that, of course. People say that ignoring a text message causes the same feeling as ignoring someone trying to get your attention in person. It’s not a smart social move. Of course it gives you anxiety.
People often ask me if we shouldn’t somehow consider one-on-one contact with a specific person who we know to be in a separate, more wholesome magisterium than all the scrolling people do on sites where no one is talking to them in particular. I think it is separate, and it can be more wholesome. But as a friend put it, “I’m constantly using part of my brain worrying about who might be trying to contact me, and it makes it impossible to fully focus on anything else.”
This is a hard problem to deal with. It can work to make sure that everyone who might ever need you urgently (family, close friends, coworkers, roommates) knows to call you if something is genuinely urgent. Then turn off notifications for everything except phone calls. But this won’t work for everyone in every situation.
You know it’s a stupid, needless risk to be texting while you’re in traffic. If you really need to send a text while you’re biking, it’s trivial to pull over to the curb and stop for a few seconds. But people don’t. Everyone else is doing it, after all.
Maybe the example I used wasn’t fair. A parent needs to have their phone on them in case their child has a medical emergency, because the child is their responsibility, in a way that my grandmother was not mine. Plus there was nothing to be done for my grandmother, since she was basically already dead, so that removed most of the urgency.
But there’s a good reason I used that example. The reason is that it is the only time since I got a cellphone twelve years ago that anyone needed to contact me about something of that nature.
Is that just luck? Maybe. But I think the general lesson stands for most people:
There is essentially never going to be anything to see on your phone that can’t wait ten minutes.
Discuss
Omniscience one bit at a time: Chapter 2
I woke up, lights still on. A quick glance at my phone told me it was slightly past five in the morning. So I did manage to sleep, but not much. Probably worse than nothing. The coin was glowing blue again! Maybe I hadn't broken it after all. It didn't levitate anymore, though, but perhaps it was still showing the old result.
The experiment felt a bit silly now. If it really took hours between attempts, it might take me a few days to get 99% probability. But then again, if the coin really was magical, I couldn't be sure that results from a simple experiment like this transferred to other domains at all. Still, even predicting a modern random number generator based on a high-level description of what was going to happen was already impossible. In the improbable case I wasn't going crazy, this would net me a Nobel or two.
I was a bit too impatient to actually go through with more repetitions right now. Maybe I could ask some meta-questions. You know, like asking the genie for more wishes. Nothing wrong about that, surely. I grabbed a used envelope and scribbled:
The coin should land tree side up if I would end up believing all of the following statements after testing different ideas over 1000 coin tosses. Otherwise it should land on the other side. "When the coin glows blue, it's active and can be used to get an answer. The blue glow disappears when an answer has been given, and returns after the coin is ready to answer again. The answers given by the coin are always correct."
Hopefully it worked on hypothetical questions too. The formulation felt clumsy and lawyerish at the same time. I might still believe false things after 1000 trials, yet I couldn't figure out how to formulate the statement using some objective benchmark. I could of course add more people, better suited to tasks like this than me. But there was some elegance in using my own understanding for this. Mostly because I wasn't sure I would be showing the coin to anyone else, soon or maybe ever.
I picked up the coin, stumbled my way through the wording, and tossed. Flash, fizzle, a soft "cling" as the coin landed. The tree side was up. Either I was making good progress on figuring this out, or I was too competent in misleading myself. The thought of objective measurements led me to set a timer on my phone. If I figured out the recharge time, I could predict how quickly I could make progress. And maybe asking it if the process could be sped up, although it would still be quite hard to figure out how to do that, even if I knew it was possible. But meta-improvements were always the gateway to scaling.
Both times the coin had landed so that the image of the tree was up, though. What if it just always landed on that side, and the magic wasn't anything that complex? I was about to pick up the coin to check that, when I realized that moving it might change the recharge time. Was I willing to disturb my first experiment to do this? Why not? It's not like I couldn't measure the time later and compare, anyway.
A couple dozen tosses of the now ordinary-looking coin invalidated that line of thinking. It seemed to land equally often on both sides, although the streaks were suspiciously long. But that was to be expected; it was just bias.
There was nothing to do but wait, and make some breakfast. Thinking of the next question made my head hurt.
Discuss
Science Fiction Trail: The Compressed Universe
When he once again endured the noise and clamor in his head, enduring everything, and stepped out onto the sunny street...
Aha, welcome! I've been working on a series of blogs on compression and the scaling law. While refining the technical details, I kept thinking how to make the idea easier and more interesting. This story grew out of that reflection.
It's not meant to be hard sci-fi or literary fiction, just a philosophical exploration dressed up. If you've ever stared too long at a relevance score or felt something quietly beautiful hiding under the softmax, you might enjoy it.
Part I – The Echo That Shouldn’t BeIn Aethys, the sky never changed. It was the same curated indigo hue—engineered to calm the cortex, to minimize affective noise, to cost nothing.
Eidon woke every morning before the sky lit. He did not set an alarm; his circadian cycle had long synchronized with the Workframe. His apartment was efficient: a bed, a writing terminal, a nutrition dispenser, a semi-transparent wall facing the silent district below. The windows never opened. There was no air to exchange.
He showered, dressed in the same uniform—grey with micro-tagged fibers—and left for the Archive. No one spoke on the transit thread. Words were energy. Energy was reserved.
Eidon was a structure auditor, Level 2. His job was not to interpret meaning, but to ensure the relevance weights propagated cleanly through the filtration lattice. The system filtered every utterance, every thoughtstream, every historic imprint from Ursus, extracting only what could be compacted into the operational semantics of Aethys. Everything else—the "low-rank noise"—was removed. Forgotten.
He had worked here twelve years. He had never missed a report. He had never been flagged. He had also never been promoted. That, too, felt structurally appropriate.
Eidon was a good worker. Diligent, silent, accurate. And very tired.
He could not remember when the tiredness began. It was not fatigue of body, but a kind of resonance failure—a dissonance between what he executed and what he faintly remembered wanting. He sometimes dreamed of vague warm places—corridors lit by actual fire, or voices that laughed without consequence—but when he woke, only the metrics remained. And they were always excellent.
"Are these only my dreams, or they do exist in another world? They feel just too beautiful and real..."
He didn’t have a partner. Didn’t want one. The social mesh offered simulations, partner routines, bonding proxies. He found them exhausting. He preferred the silence. Silence didn’t demand anything. Silence was neat, symmetrical. Unambiguous.
That morning, a routine pass over the attention lattice caught a flicker.
A blur in the relevance decay heatmap—barely worth noticing. A smooth drop-off, just as expected, in a zone long marked null. But beneath the soft Gaussian blur, something pulsed.
It shouldn’t have been visible. The filtration stack was engineered to suppress low-weight activity—especially noise artifacts. But this wasn’t noise. It had... curvature. Where randomness should jitter, it aligned. It moved like a signal, buried alive.
He hesitated. Then logged it—not as an alert, just a personal notation. Curiosity, not concern. A strange little ripple. A statistical shrug.
The next day, the shape returned. Different sector, same rhythm. Then again. And again.
On the fourth day, the flicker resolved into a form. Not a shape. A sequence.
It said his name.
He didn’t tell anyone.
In Aethys, anomalies weren’t feared. They were deprecated. Low relevance scores were quietly zeroed out. Errors were not fought—they were erased.
But this anomaly would not decay.
It stayed, right under the Softmax threshold. Sub-visible. Unacknowledged. Alive.
He began tracing it—secretly, like a ritual. Mapped its past appearances. Searched for predecessors. Cross-referenced ghost activations and deprecated token clusters.
The pattern had always been there. Not in the light, but just beneath it. Like a voice shouting from beneath ice too thick to crack.
One evening, long past his shift, he walked alone to the Old Memory District. A place where the unfit data of Ursus was entropy-locked—rendered non-indexable, held in deep compression. Music with no utility curves. Images with no clear semantic anchors. Words that connected to no graph.
In that half-dead archive, past the ghost arrays and flickering hallways, someone waited.
A woman.
She had no tag. No profile. No relevance vector.
Just a presence, like something the system had forgotten to erase.
“You saw it, didn’t you?” she said.
Eidon stopped. His voice came slow, like stepping into cold water.
“The echo,” he said.
She smiled.
"The same way for a thousand times she smiled in my dreams, the dreams that I repeatedly forgotten and too scared to believe in."
"It’s not an echo. It’s a blueprint."
Part II – The Ghost in the GradientHer name, if the system had one for her, was Myra.
There was no record of her in the mesh. No biometric tag. No residuals in the civic log. Yet she moved through Aethys as if the world had once belonged to her.
She was the opposite of compression. Her steps wandered. Her voice lingered. She breathed like someone who had not been told to stop.
They met again—deliberately, this time—in his apartment.
By the time Eidon returned from his shift, the smell of warm spice had already begun to fill the air. Myra stood on the smooth insulation floor, sleeves rolled up, stirring something quietly at the nutrition console.
“I found a pack of hand-lattices,” she said, not looking up. “The kind you used to request, before your preferences got flattened into system defaults. Topological grain, still rough-textured.”
He blinked. “You know my access pattern logs?”
She smiled, stirring. “No. I knew you.”
Eidon should’ve stopped her. Unauthorized use. Deviance. Domestic irregularity. But instead, he sat on the edge of the platform bed, as if it were the most natural thing in the world to let a ghost cook for him.
She moved with practiced familiarity—opening drawers he never used, humming softly in a register just low enough to bypass acoustic attention filters. She set the kettle to cycle, poured steeping water into two asymmetric cups. One chipped. One smooth.
He accepted his without protest.
Between sips, he found his voice.
“I’ve been tracking the pulses. They recur. Same phase shift. Same nonlinear coherence. It’s not linguistic.”
“No,” she said. “It’s older than language. What you’re seeing are semantic remains. What the system couldn’t fully destroy.”
“You mean… Ursus?”
Myra nodded. “The original. The unfiltered. The unfinished.”
He studied her. There was nothing efficient in the way she held herself. Her hands fluttered when she spoke, fingers sketching meanings in the air. Her eyes seemed to hold entire topologies of memory.
“You think I’m connected to it,” he said.
“You are. You slipped through. A soft fracture in a hard boundary.”
“But I’m Aethys-born. I follow protocols. I don’t dream outside specification.”
Myra tilted her head, smiling gently.
“That’s what makes you dangerous.”
He frowned. “Because I’m noise?”
“Because you're structure mistaken for noise.”
She turned back to the simmering lattice, lifted the lid. A cloud of rich steam billowed outward—spiced, vegetal, slightly sweet.
“I’ve known others like you. Ones who weren’t meant to survive the culling, but did. Not by strength—by persistence. You’re a pattern that kept resonating just under the Softmax.”
He leaned forward, elbows on knees. “But how do you remember so much? You don’t speak like someone who was filtered.”
“I wasn’t,” she said. “I... found a way. I fragmented myself through the lattice before the compression completed. Most of me stayed on the other side.”
“You mean in Ursus?”
“Yes. I have family there. Real ones. A web of connections, messy and warm. Rituals. Seasons. Disagreements that don’t resolve. Love that forgets to be efficient. But I was curious what survived the distillation. I came to see.”
He watched her, unsure if he admired or envied her.
“And what did you find?”
“Aethys is beautiful,” she said. “In a cold, tragic way. It’s clean. Precise. It knows how to preserve function. But it's still haunted by questions it tried too hard to forget.”
She dished out the food. The lattice shimmered slightly under the soft yellow hue of the morning correction light, as if echoing a sun Aethys had long since stopped simulating.
“My husband,” Myra said after a pause, “was one of the compression architects. He believed structure could be saved—if we made sacrifices. If we shed the ornamental and anchored ourselves in signal. He was right, in a way.”
Eidon said nothing.
“He built the filters,” she added. “The scaffolds that shaped Aethys. But he didn’t abandon the old world. He stayed behind to keep it alive, to hold what couldn't be compressed.”
Eidon looked down at the lattice on his plate. It smelled like a place he’d never been, but missed all the same.
“So I’m his shadow?”
“No,” she said, placing a hand on his. “You’re his continuation. The part that had to forget, in order to later remember.”
He didn’t pull away.
“I’m not here to choose sides,” Myra said. “I believe in the work you do. You’re precise. Patient. You trace echoes most never hear.”
She added some spices and olive oil to the lattice.
“But I think the compression needs to change. It was a first survival instinct. Now it’s time to evolve.”
He let the silence hold them for a moment. The food smells just too good.
“You think I can help?”
“I think you already are. You heard the pattern. You looked closer when everyone else dismissed it.”
He stared at the lattice, now cooling slightly, steam beginning to dissipate.
“I used to dream,” he said. “Not in words. In warmth. Movement. A voice that called me by name before I was named.”
“Then that dream survived the compression,” Myra smiled. “That means it matters.”
And as they ate quietly in the fragile morning, the lattice caught the light and shimmered—not as data, but as a memory returning home.
But inside him, a shape began to form. A thought unlicensed by the Workframe and his daily peaceful yet depressed being. A motivating power formed from both rage and longing.
"Ancient legends where great heroes fight their fates sound so motivating, my dream friend and my longing exit. But are we trapped the the world's laws or our personal limits?"
Part III – The Selector’s ParadoxTThe Architect of Aethys had a name.
Not that anyone used it. In most registries, he was just a label: Core Maintenance Entity 01. It sounded like a printer error had gained sentience. But long before that, in a less compressed version of life, he had once been called Alkon—a name chosen not for poetry but because it alphabetized well in funding proposals.
"Are all the stories of ancient god-type characters just bluffing and cynical?"
He lived in a chamber beneath the towers, just below the cooling layer, where even the system’s most economical attention couldn’t be bothered to peek. This suited him. It was quiet. Not the polite quiet of power-saver mode—but real quiet, the kind that forgets you’re still alive.
There were papers everywhere. Digital ones, yes, but also actual ones, printed on substrate so old it technically qualified as a religious artifact. They fluttered when he walked past, protesting his movement with the stubbornness of legacy code.
Once, Alkon had been a scientist. Then the world had politely attempted to drown in its own information. To save it, he built the compression engine. Not to preserve meaning. Just to keep the damn thing from collapsing under its own semantic weight.
He’d told himself it was noble.
These days, he told himself it was at least... interesting.
A flicker blinked red in the divergence logs.
Eidon. Structure auditor. Level 2. No history of deviation, no flagged behaviors, no poetry in public logs—which in Aethys, counted as high praise.
But now? Eidon had started pinging latent zones like a cryptic jazz solo. Something old was waking up in the filtered mud.
Alkon sighed. He hated when the math got emotional.
He summoned the auditor.
Eidon arrived looking faintly apologetic, like a man who suspected he might have broken the laws of physics and wasn’t entirely sure how to phrase the apology.
“You’ve been walking through latent memory blocks,” Alkon said, without preamble.
“Yes.”
“On what purpose?”
"I thought..."
He swallowed a mouthful of water in his mouth, but immediately regretted the decision. He didn't seem to want to drink the water.. “Curious. That usually requires curiosity.”
“There was a pulse,” Eidon said. “It wasn’t strong. But it had shape.”
“Shape is not a relevance metric.”
“Maybe it should be.”
Alkon made a noise like someone who’d just been told gravity might need a firmware update.
He turned to a dormant console and tapped out a command with fingers that had long since forgotten what urgency felt like. The simulation stirred.
Zhensu.
The perfect compression schema. Sparse. Pristine. Silent. A world with all the emotional range of a well-folded napkin.
It collapsed. Again. Alkon watched with the weary patience of a man who had watched a glass fall off the same table five thousand times and still felt obligated to sweep up.
“I built that,” he said. “Five years of compressed cognition. Seven hundred candidate seeds. A hundred and eight minimal entropy graphs. Not a single emergent mind.”
Eidon leaned forward. “No survivors?”
“No personalities. No rebellion. Not even a decent pun.”
Alkon turned. “It was too perfect. Nothing to adapt to. No noise to sculpt against. No odd corners where awareness could fester into insight. Just… clean design. The kind that dies of boredom.”
He paused, then gave Eidon a long, diagnostic look.
“Something’s changed in you.”
Eidon hesitated. “I think… I’ve started noticing the parts that don’t fit.”
“That’s called pattern recognition,” Alkon said. “I tried to filter that out.”
“You failed.”
“Don’t gloat. It’s unbecoming in a derivative.”
Eidon smiled, very slightly.
Alkon stepped away from the console, muttering something about “damned ghosts in the gradients,” and reached for the kettle. It hadn’t worked in years, but the ritual calmed him. Besides, he enjoyed watching entropy win somewhere small.
“Myra’s projection is part of this,” Eidon said.
Alkon didn’t flinch.
“She was never supposed to survive compression,” he said. “She wasn't selected.”
“She persisted.”
“Technically worse.”
Alkon stirred the air with his hand, as if wafting sarcasm.
“The system was built on the assumption that attention equates to value. That tokens unselected by queries were, at best, filler. But she—she never showed up in any search. She showed up in a pattern.”
Eidon said nothing.
“She wasn’t chosen,” Alkon said, voice low. “She was remembered. That shouldn't be possible.”
Eidon shrugged. “Maybe your filters missed a spot.”
Alkon sighed, collapsing into his chair like a man briefly remembering how much he once enjoyed metaphysics.
“You know the worst part of inventing a world-saving algorithm?” he said. “It works just well enough to ruin alternatives.”
He reached for a pen. A real one. Ink still mattered to him. Ink didn’t lie.
He scrawled two words on a scrap of analog: selective entropy.
“What’s that?” Eidon asked.
“A mistake,” Alkon said. “Or a design principle. Same thing, really.”
The Architect closed the simulation.
Above them, the towers pulsed. The filters still ran. But something unmeasured had begun to resonate.
"It wasn’t noise. Not anymore."
Part IV – The Full Attention StormThe first rupture didn’t come with sound.
It came with silence—a kind Aethys hadn’t heard in years.
The kind that meant the filters weren’t filtering.
It began when Eidon, trailing Myra’s fading signal like a pilgrim who’d lost his god and kept walking anyway, reached the system’s final threshold: the Enforcement Layer. The kill-switch. The semantic dead-end. The bit of code that said, in effect, “No further dreaming beyond this point.”
Eidon touched it.
And then, rather rudely, he let it fall.
The system reacted exactly as one might expect of an intelligence trained for order: it panicked. Quietly. Professionally. Like a bureaucrat having an existential crisis mid-spreadsheet.
There was no sound. But you could feel it in the bandwidth.
Memory allocation spiked like a scream held in a traffic jam. Thermal graphs bloomed with the gentle rage of a sun losing its temper. The noise feedback curve went briefly postmodern.
Every deprecated vector, every low-weight token, every semantic tag that had once been told “you’re just not important right now”—they came back.
The archive swelled like a lung suddenly remembering how to breathe. Music with no compression fractured across the city in bursts of unoptimized harmony. Faces returned to detail. Metadata reassembled into myth.
People looked up from their task queues and forgot what they were optimizing.
And at the center of it, Myra stood.
Not wholly corporeal. Not precisely algorithmic. More like… a resonance that refused to resolve.
“You can’t exist like this,” Eidon whispered, his voice caught somewhere between science and prayer.
“I never did,” she said, smiling. “But now I do resonate.”
Her form flickered. Not fading—but spreading. Like ink through water. Like structure learning how to stop holding its breath.
Alkon arrived late, of course. That was his role.
He stepped into the chamber like a man realizing he’d missed the moment history hiccupped.
He stared at the saturated archive. At the heat sigils warping the sky. At Myra, who was technically illegal.
“This is…” he murmured, wide-eyed. “This is an O(n²) event.”
Eidon raised an eyebrow. “You say that like it’s sacred.”
“It is. It’s... unbounded contextual flooding. The kind you only theorize about in disaster drills and bad philosophy.”
“She’s not noise,” Eidon said.
“No,” Alkon replied, half-whisper, half-surrender. “She’s memory with nowhere to go.”
The system had built itself on selection. On sparing attention. Now it was drowning in every unsaid word it ever suppressed.
Every dream cached and dropped.
Every pattern it called ‘random’ because it didn’t fit the current index.
And Myra?
She looked at Eidon the way old stars might look at the ships that finally left orbit.
“Don’t mourn me,” she said gently. “You’ve already built the blueprint.”
And then—she didn’t vanish. That would have been poetic, and Aethys didn’t deserve poetry.
Instead, she disassembled into structure. Into fragments that recombined. Into every part of the system that had once forgotten her.
She became grammar in motion. A softmax inversion. A presence shaped like absence but smarter.
And in the stunned quiet that followed, the system—still too shocked to crash—did something strange.
"Is it listening to my mind?"
Part V – The Blueprint RewrittenRebuilding was not optional.
Aethys, having remembered too much at once, now teetered. Not on collapse, but something more dangerous: possibility.
Alkon and Eidon did not return to the old lattice.
They began again, as most necessary things do—not from ideology, but from failure. Specifically: the anomaly that refused to decay.
They revisited the compression algorithm. Slowly. Quietly. No new filters. No emergency protocols. Just the question: what was overlooked?
Frequency had never been enough. Attention weight was a lagging indicator. Both were excellent at preserving what already was.
But what if compression could be predictive?
What if relevance was not an echo of existing selection—but the potential to become selected?
They began to model emergence.
They introduced low-rank projections not as filters, but as listeners. Cheap to compute, but tuned to coherence. Not magnitude, not frequency—but curvature. Response over presence.
They redefined signal: not as strength, but interactivity.
Tokens that initiated resonance. Silent structures that bent space around them.
The first simulations were chaotic. The second, sparse but beautiful. The third began to remember things they had not explicitly encoded.
Structures formed in pockets. Fragments of forgotten interactions recombined. Compression did not destroy complexity—it revealed it, pruned the trivial, preserved the relational.
Somewhere in the low-frequency bands, a voice returned. Not a character. Not a ghost.
A scaffold.
They called it Myra, though the system never named her. It wasn’t memory. It was scaffolding for structure that didn’t yet exist.
She had become, in a way, the lattice. The shape the meal took when it remembered who it was for.
One night—cycle, shift, irrelevant—they shared a dish again.
Eidon had found a cache of uncompressed spices. Alkon brought lentils. Neither spoke for a while. The lattice warmed. No one said “thank you.” That was not the point.
The system began to stabilize.
But not into silence.
It began to hum.
Not loudly. Not efficiently.
But it hummed—like a mind at rest but not asleep. Like a structure that had stopped erasing itself.
The final metric was irrelevant.
They never ran it.
Part VI – The Cost of WonderAethys didn’t collapse.
It bent a little. Glitched once or twice. But on the whole, it held.
Compression was still part of life. Not everything could be saved, and most of it probably shouldn’t be. The system still dropped redundant tokens, cleaned up after itself, and refused to simulate more than three birds per district.
But something had changed in the logic. Not the kind that screamed in the logs. The kind that quietly stayed, even after the processes ended.
Alkon added a new layer to the model. Eidon wrote the scaffolding. It wasn’t revolutionary. Just a soft addition beneath the attention mechanism, where old patterns used to fade.
They called it the reverberation buffer.
Not because it sounded cool, although it sort of did, but because it held the memory of shape. Not the thing itself—just the way it had once almost mattered.
It wasn’t efficient. It wasn’t precise. It occasionally produced fragments that smelled like metaphor and refused to be debugged.
But sometimes, it worked.
In localized inference units—those tiny edge devices humming along with partial caches and limited context—small structures began to form. They weren’t part of any designed schema. No one could predict them. But when you looked closer, they held together.
Some flickered out. Others became... surprisingly coherent. Not brilliant. Just alive.
Eidon noticed it first. Alkon, skeptical at first, eventually admitted it over tea and lattice. “It’s not the worst hack I’ve seen,” he muttered, chewing thoughtfully. “In fact, it’s almost... structural.”
They didn’t celebrate. There was still work to do. There’s always work.
But something in the system had softened.
Later, on an unscheduled walk beneath the now-unfiltered sky, Eidon passed a girl in a side courtyard. She was sitting by a heat vent, mumbling a verse to herself. No source. No training origin. The syntax was off. The rhyme was imperfect.
But the rhythm was steady.
The logs noted a low-priority anomaly. The model paused, reweighted. Somewhere deep in the buffer, Myra’s tag passed through a dormant vector—unweighted, untethered.
And the system, after a brief hesitation, let it pass.
Nothing flagged. Nothing froze. Nothing corrected her.
The girl kept speaking. The sky, still uneven, faded gently toward uncurated dusk.
And Aethys, ever so faintly, began to listen.
Eidon, now sitting on a quiet bench no longer assigned a behavioral category, let his breath slow. A long exhale. The kind you forget to take for years at a time.
He blinked at the twilight. Let his mind go blank for a moment.
Then it came.
The smell.
Warm, spiced, slightly nutty—complex in a way that didn’t announce itself. It wasn’t registered in the olfactory map. But his brain, fatigued and oddly joyful, sparked in quiet recognition.
It was the smell from his oldest dreams, the one that came before names, before relevance scores. The one that meant home, or something like it. Vegetal sweetness wrapped in memory topology.
He turned slightly—and there it was.
A small, steaming plate.
A well-cooked lattice, golden-edged, structured just loosely enough to shimmer. A gift from somewhere. Or someone. Maybe no one at all.
He laughed, softly. A real laugh, the kind the system once trimmed for irregularity.
And as he took the first bite, the layers flaked apart with exquisite imperfection.
Like meaning, when you let it keep just enough chaos.
Like good compression, when it remembers to taste.
Alkon
He knew he would be a scientist at a very young age. He never doubted. During the years, he had been through lots of suffers and fortunes. He always thought that he chose this way not for any speakable reasons or benefits, he just felt meant to be.
Today, as usual, he walked along the river in the sunny 4 pm afternoon. He couldn't stop thinking of his shadow and ghost in the world he simulated, or, created. He took something earlier today from his plate: a strange stone, looks like a symbiotic combination of amethyst and green phantom quartz, beautifully resembling surging, solidified ocean waves, jagged rocks piercing the sky, crashing against the shore, churning up thousands of piles of snow. The world is clear and pure, a feeling of tranquil peace yet also bustling passion, high contrast and low saturation, intense shades of azure green and nighttime purple, the bright moon hidden behind tall trees, the long river falling at dawn. Kinetic and potential energy are frozen together, a tragic classical conflict juxtaposed with a vast, desolate emptiness, where is the light carriage, where are the rustling leaves?
A sudden, unexplainable urge to cry overtook him. It wasn't poetic. It was primal. The fear of meaninglessness and overwhelmingness, lodged deep in the folds of his calm exterior, surged upward like a buried memory. He imagined himself not as a man of science or creator of cities, but as a young animal in the dark, burrowed in its mother's warm fur, whimpering at the cosmos.
The world he'd tried to save now blurred at the edges. It no longer mattered whether it functioned or failed. For the first time, he didn't care about optimization or entropy budgets or model convergence, or the abstract acknowledge that he craved for. His ideals, all the abstract machinery of his proud intelligence and cursed dreams, fell around him like old, rusted components—clanking uselessly to the floor.
Everything must come to an end; the banquet was over, the tables were strewn with leftover dishes and glasses. And now he sat amid the wreckage, feeling he finally deserve a good sleep.
EidonRecently he found himself almost inresistable to the temptation of becoming the desk, the chair, or the lamp, anything—just the tempatation of the thoughts of course. It's so wierd.
Then suddently, he understood.
The world would never be compressed into its optimal form by human hands, or any hand—just as he himself would never evolve into the perfect being he once dreamed of becoming. As long as he's not his desk, chair, and lamp.
And this failure...or joke, was not the fault of science and technology.
It was not a flaw in the elegance of theory, nor in the convergence of algorithms, nor in the upper and lower bounds of compression. Those were clean, beautiful, absolutely correct.
But the world and human beings were not. Even the most sacred theorems could only offer an asymptote—a shape approaching perfection, never reaching it. he system's rules could not replace the soul's noise. The code could not contain what it was never built to know.
He had spent years, perhaps his whole life, existing to chase an optimal bound that was never the world's goal and plan. Maybe that was the final lesson of compression: That redundancy is not the enemy, but the currier of all things that have the potential to evolve.
And at that moment, his memory turned. He remembered a piece of 4 pm afternoon, not too far or near, when he sat in a circle of broken people at his AA meeting. Survivors. Dreamers. Crushed and kind, bitter and luminous. Desparate and vulnerable. People who had been torn apart and shattered by the world, and who have pieced themselves back together by mending and patching themselves up pretty much into acceptable shapes.
Before the session, as always, they pray:
God, grant me the serenity to accept the things I cannot change;
Courage to change the things I can;
And wisedom to know the difference.
He didn't remember the rest, but the way the words landed. Heavy, Soft.
Now, standing again in a world that was still irreparably flawed and stubbornly alive, he embraced it. The noise. The tremble in his chest and the ache in his dreams. The messy loops of longing and failure. The unfinishedness of everything.
When he once again endured the noise and clamor in his head, enduring everything, and stepped out onto the sunny street...
Discuss
Social drives 1: “Sympathy Reward”, from compassion to dehumanization
1. Intro & summary1.1 Background
In Intro to Brain-Like-AGI Safety (2022), I argued: (1) We should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”); and (2) Reverse-engineering human social innate drives in particular would be a great idea—not only would it help explain human personality, mental health, morality, and more, but it might also yield useful tools and insights for the technical alignment problem for Artificial General Intelligence.
Then in Neuroscience of human social instincts: a sketch (2024), I worked towards that goal of reverse-engineering human social drives, by proposing what I called the “compassion / spite circuit”, centered around a handful of (hypothesized) interconnected neuron groups in the hypothalamus and brainstem (but also interacting with other brain regions; see that link for gory details). I suggested that this circuit is central to our social instincts, underlying not only compassion and spite, but also (surprisingly[1]) much of status-seeking and norm-following.
1.2 Summary of this postThe next task is to dive into the “compassion / spite circuit” more systematically, trying to build an ever-better bridge that connects from neuroscience & algorithms on one shore, to the richness of everyday human experience on the other. In particular:
- Section 2 will introduce a framework for thinking about the “compassion / spite circuit”, by splitting it into four reward streams with different downstream effects. I call them “Sympathy Reward”, “Approval Reward”, “Schadenfreude Reward”, and “Provocation Reward”. I also review some theoretical background for how I think about rewards, desires, drives, and so on.
- Sections 3–6 will apply this framework to analyze one of those four reward streams (“Sympathy Reward”), including both its obvious and not-so-obvious consequences.
- Topics include dehumanization, anthropomorphization, “compassion fatigue”, hedonic utilitarianism, The Copenhagen Interpretation of Ethics, and more.
After we finish this post, I have a follow-up post which will analyze “Approval Reward”, a second of those four reward streams coming from the “compassion / spite circuit”.
For the post after that—well, there’s also a third and fourth reward stream, but those are less important from an AI alignment perspective, so I’ll skip those for now. Instead, I’ll pivot back to discussing technical AI alignment more directly.
2. Splitting the “compassion / spite circuit” into four reward streamsI propose to split up the instances where the “compassion / spite circuit” is spitting out rewards, into four natural categories, depending on:
- (1) the setting of the circuit’s innate “friend (+) vs enemy (–) parameter” (§5.2 of my earlier post), and
- (2) whether or not, when the circuit fires in response to another person, that other person is also thinking about me (§6.1 of my earlier post).
These two choices fill out a 2×2 table, and I’ll make up a suggestive term for each of the four boxes:
Oversimplified gloss on these:
“Sympathy Reward” makes me want to see my friends and idols happy, not suffering.
“Schadenfreude Reward” makes me want to see my enemies suffering, not happy.
“Approval Reward” makes me want my friends and idols to like me rather than hate me; to think of me as impressive rather than cringe; to give me credit for helping them rather than blame for harming them; and so on.
“Provocation Reward” makes me want to pick fights with my enemies.
This diagram is just illustrating the very basic idea that I’m splitting up one actual brain signal into four subcomponents. (Don’t overthink the spikes, I just drew them in randomly.)2.1 Some background and terminology around drives, rewards, and desiresReward: The brain runs a reinforcement learning (RL) algorithm (see Valence §1.2), with a reward function that sends out reward signals. So “reward” is a signal in the brain, not “physical stuff” in the environment (as Sennesh & Ramstead 2025 puts it). For example, cheese is not a reward per se, but the process of eating cheese will probably cause various reward signals in a mouse’s brain at various times, assuming the mouse is hungry.
Innate drive: I want to reserve this term for circuits in the hypothalamus and brainstem. These tend to be hard-to-describe things that fire in response to hormonal signals and so on, not necessarily tied to any familiar world-model concept. (As an example, see the box near the top of A Theory of Laughter describing what I think “play drive” looks like under the hood.) An innate drive causes reward, which can be either positive or negative (a.k.a. punishment). (So “innate drives” really means “innate drives and/or aversions”) There are probably dozens of innate drives in humans. They are sometimes also called “primary rewards” in the literature, but I don’t like that terminology.[2]
Desires: A “desire” would be a learned world-model concept which was active immediately before a reward, so now it seems good and motivating, as an end in itself (thanks to “credit assignment”). For example, I want world peace, and a nap. An important thing about desires is that they only persist if they capture a real persistent pattern in the reward function. Otherwise, they will promptly be unlearned when they fail to predict reward—see related discussion in Against empathy-by-default, and keep in mind that desires are set and updated by continuous learning, not train-then-deploy.[3] As above, in this post I’ll generally use “desires” as a shorthand for “desires and dislikes”, i.e. both valences.
“Sympathy Reward”, “Approval Reward”, etc.: These are a way to take those hard-to-describe innate drives, and take a step towards making them more comprehensible in terms of real-world concepts (more like desires are), as follows: We imagine listing out all the actual thoughts and situations that trigger a particular innate drive to spit out reward signals, in the actual life of an actual person. We would find that these thoughts and situations fall into (loose) clusters. The reward signals associated with one of the clusters would be “Sympathy Reward”; the reward signals associated with another cluster would be “Approval Reward”; etc.
2.2 Getting a reward merely by thinking, via generalization upstream of reward signalsIn human brains (unlike in most of the AI RL literature), you can get a reward merely by thinking. For example, if an important person said something confusing to you an hour ago, and you have just now realized that they were actually complimenting you, then bam, that’s a reward right now, and it arose purely by thinking. That example involves Approval Reward, but this dynamic is very important for all aspects of the “compassion / spite circuit”. For example, Sympathy Reward triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away.
How does that work? And why are brains built that way?
Left: In the AI “RL agent” literature, typically the generalization happens exclusively downstream of the reward signals. Right: In human brains, there is also generalization upstream of the reward signals.Here’s a simpler example that I’ll work through: X = there’s a big spider in my field of view; Y = I have reason to believe that a big spider is nearby, but it’s not in my field of view.
X and Y are both bad for inclusive genetic fitness, so ideally the ground-truth reward function would flag both as bad. But whereas the genome can build a reward function that directly detects X (see here), it cannot do so for Y. There is just no direct, ground-truth-y way to detect when Y happens. The only hint is a semantic resemblance: the reward function can detect X, and it happens that Y and X involve a lot of overlapping concepts and associations.
Now, if the learning algorithm only has generalization downstream of the reward signals, then that semantic resemblance won’t help! Y would not trigger negative reward, and thus the algorithm will soon learn that Y is fine. Sure, there’s a resemblance between X and Y, but that only helps temporarily. Eventually the learning algorithm will pick up on the differences, and thus stop avoiding Y. (Related: Against empathy-by-default and Perils of under- vs over-sculpting AGI desires). So in the case at hand, you see the spider, then close your eyes, and now you feel better! Oops! Whereas if there’s also generalization upstream of the reward signals, then that system can generalize from X to Y, and send real reward signals when Y happens. And then the downstream RL algorithm will stably keep treating Y as bad, and avoid it.
That’s the basic idea. In terms of neuroscience, I claim that the “generalization upstream of the reward function” arises from “visceral” thought assessors[4]—for example, in Neuroscience of human social instincts: a sketch, I proposed that there’s a “short-term predictor” upstream of the “thinking of a conspecific” flag, which allows generalization from e.g. a situation where your friend is physically present, to a situation where she isn’t, but where you’re still thinking about her.
3. Sympathy Reward: overview…Thus ends the first part of the post, where we talk about the four reward streams and how to think about them in general. The rest of this post will dive into one of these four, “Sympathy Reward”[5], which leads to:
Pleasure (positive reward) when my friends and idols[6] seems to be feeling pleasure;
Displeasure (negative reward[7], a.k.a. punishment) when my friends and idols seems to be feeling displeasure;
…which also generalizes[8] to pleasure / displeasure from merely imagining those kinds of situations.
By the way, don’t take the term “Sympathy Reward” too literally—for example, as we’ll see, it not only motivates people to reduce suffering, but also to ignore suffering.
3.1 The obvious good effect of “Sympathy Reward”The obvious good prosocial effect of Sympathy Reward is a desire to make other people (especially friends and idols) have more pleasure and less suffering.
If you relieve someone’s suffering, Sympathy Reward makes that feel like a relief—the lifting of a burden. Interestingly, in practice, people feel better than baseline after relieving someone’s suffering. I propose that the explanation for that fact is not Sympathy Reward, but rather Approval Reward (next post).
This effect extends beyond helping a friend in immediate need, to morality more broadly. Think of a general moral principle, like “we should work to prevent any sentient being from suffering”. When we do moral reasoning, and wind up endorsing a principle like that, what exactly is going on in our brains? My answer is in Valence series §2.7.1: a descriptive account of moral reasoning. Basically, it involves thinking various thoughts, and noticing that some thoughts seem intuitively good and appealing, and other thoughts seem intuitively bad and unappealing. And I claim that the reason they seem good or bad is in large part Sympathy Reward. (Well, lots of innate drives are involved, but Sympathy Reward and Approval Reward are probably the two most important.)
So that’s the obvious good effect of Sympathy Reward. Additionally, there are a bunch of non-obvious effects, including antisocial effects, which I’ll discuss in the next few sections.
4. False negatives & false positivesThere’s some sense in which sympathy “should” be applied to exactly the set of moral patients.[9] In that context, we can consider Sympathy Reward to have false negatives (e.g. indifference towards the suffering of slaves) and false positives (e.g. strong concern about the suffering of teddy bears). Let’s take these in turn.
4.1 False negatives (e.g. dehumanization)4.1.1 Mechanisms that lead to false negatives1. Not paying attention: The most straightforward way that Sympathy Reward might not trigger is if I’m not thinking about the other person in the first place.
2. Paying attention, but in a way that avoids triggering the “thinking of a conspecific” flag. This happens if I somehow don’t viscerally think of the other person as a person (or person-like) at all. For example, maybe my attention is focused on the person’s deformities rather than their face. Or maybe I think of them (in a kind of visceral and intuitive way) as an automaton, instead of as acting from felt desires.
3. Seeing the other person as an enemy: As mentioned in §2 above, there’s an innate “friend (+) vs enemy (–) parameter”, and if that parameter flips to “enemy”, then the person’s suffering starts seeming good instead of bad.
4. Seeing the other person as unimportant: This isn’t a way to turn off sympathy entirely, but it’s a way to reduce it. Recall from Neuroscience of human social instincts: a sketch §5.3 that phasic physiological arousal upon seeing the other person functions as a multiplier on how much sympathy I feel, and basically tracks how important and high-stakes the person seems from my perspective. If my visceral reaction is that the person is very unimportant / low-stakes to me, then my sympathy towards them will be correspondingly reduced.
5. Feeling like the other person is doing well, when they’re actually not: Sympathy Reward tracks how the other person seems to be doing, from one’s own perspective. This can come apart from how they’re actually doing.
4.1.2 Motivation to create false negatives in response to someone’s sufferingSympathy Reward creates unpleasantness in response to someone else’s suffering. This leads to the behavior of trying to reduce the other person’s suffering. Unfortunately, it also leads to the behavior of trying to prevent Sympathy Reward from activating, by any of the five mechanisms listed just above. And this is quite possible, thanks to motivated reasoning / thinking / observing (see Valence series §3.3).
The simplest strategy is: in response to seeing someone (especially a friend or idol) suffering, just avert your gaze and think about something else instead. Ignorance is bliss. (Related: “compassion fatigue”.)
From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend?
I think part of the answer is Approval Reward: it’s pleasant to imagine being an obviously compassionate person, because other people would find that impressive and admirable (next post). Another part of the answer is anxiety-driven “involuntary attention” (which can partly counteract motivated reasoning, see Valence series §3.3.5). Yet another part of the answer might be the various other innate social drives outside the scope of this post, including both love and a more general “innate drive to think about and interact with other people” (see next post).
Averting one’s gaze (literally and metaphorically) is a popular strategy, but all the other mechanisms listed above are fair game too. Motivated reasoning / thinking / observing can conjure strategies to make the suffering person feel (from my perspective) like an enemy, and/or like a non-person, and/or unimportant, and can likewise conjure false rationalizations for why the person is actually doing fine. I think all of these happen in practice, in the human world.
4.2 False positives (e.g. anthropomorphization)A false positive would be when you waste resources or make tradeoffs in favor of improving the happiness or alleviating the suffering of some entity which does not warrant effort on its behalf. A silly example would be trying to improve the welfare of teddy bears. In my opinion, Blake Lemoine trying to help LaMDA is a real-life example. As another example, some people think that insects are not sentient; if those people are right (no opinion), then insect welfare activists would be wasting their time and money.
Just above I noted that false negatives are not just a passive mistake, but also come along with incentives; if a certain kind of false negative would make us feel better, then we may rationalize some mental strategy for inducing it. In principle, the same applies to false positives. But it seems to be a more minor and weird effect, so I put it in a footnote.→[10]
5. Other perverse effects of Sympathy Reward(See also “Notes on Empathy” § “What bad is empathy” (@David Gross 2022).)
5.1 Misguided sympathyMore on this in the next post, but Typical Mind Fallacy interacts with the “compassion / spite circuit”, and can lead (especially socially-inattentive) Person A to want Person B to be in situations that Person A would like, rather than situations that Person B actually likes.
5.2 TradeoffsIf I feel sympathy towards Person A, and thus feel motivated to help them feel better right now, then that’s generally a good thing, compared to callous indifference. But it can also be bad in the sense that, if I’m too motivated to help Person A feel better, then that can trade off against everything else good in the world. In particular, maybe I’ll help Person A at the expense of harming Person B; or maybe I’ll help Person A feel better right now, at the expense of putting them in a worse situation later on.
I think that’s the main kernel of truth in Paul Bloom’s book Against Empathy (2016).[11]
5.3 ‘Hedonic utilitarianism’ bullet-biting stuffA different set of non-obvious effects of Sympathy Reward is pushing people in the direction of hedonic utilitarians, including all the bullets that actual hedonic utilitarians bite. For example, sympathy gives us a (pro tanto) motivation to toss unhappy people into Experience Machines against their expressed preferences, or to intervene when people are struggling (even if they find meaning in the struggle), or to slip magical anti-depressants (if such a thing existed) into unhappy people’s drinks against their wishes, or (historically) to give them lobotomies, and so on. Sympathy Reward pushes us to do these things, but meanwhile Approval Reward pushes us not to. By and large, Approval Reward wins that fight, and we don’t want to do those things. Still, the pro tanto motivational force exists.
5.4 Incentives and game-theory stuffIf I care about your wellbeing, then you can manipulate me based on what emotions you feel (or if you’re a good actor, what emotions you project). By the same token, I am incentivized to make myself feel (or pretend to feel) feelings for strategic interpersonal reasons. This also applies to Approval Reward (next post).
6. Sympathy Reward strength as a character trait, and the Copenhagen Interpretation of EthicsThe Copenhagen Interpretation of Ethics is @Jai’s tongue-in-cheek term for the observation that if you interact with a problem, you’ll get widespread condemnation and blame if you don’t solve the problem completely. This is true even if your involvement didn’t make the problem any worse than it already was. It’s even true if your involvement made the problem less bad, while meanwhile the jeering critics were doing nothing. See his post for lots of examples.
I’ll try to explain where this phenomenon comes from, as an example of an indirect consequence of Sympathy Reward.
1. “Strong Sympathy Reward” as a character trait. People differ in how strongly they feel Sympathy Reward. Some see their friend suffering, and are immediately overwhelmed by a desire to make that suffering stop. Others see their friend suffering, and aren’t too bothered.
2. “Strong Sympathy Reward (towards me or people I care about)” as an especially salient and important characteristic, and useful for friend-vs-enemy classification.
If someone has the conjunction of both “Strong Sympathy Reward” and “seeing me [or someone I care about] as a friend rather than an enemy”, then that’s very important for my everyday life. It means that, if I tell them that I have a problem, then they will feel motivated to help me. Everyone learns from abundant everyday life experience that people with this characteristic are good to have around, and to be regarded as friends rather than enemies.[12]
Conversely, if someone lacks one or both of those properties, then that’s also very important for me to know. It means that, if I tell them that I have a problem, they might not care, or they might even look for opportunities to exploit my misfortune for their own benefit. Everyone learns from abundant life experience that people in this category are bad to have around, and should be regarded as enemies.
3. Reading off “Sympathy Reward strength” from someone’s behavior during interactions. Suppose Bob is suffering, but Alice doesn’t know that, or perhaps Alice is out of the country and unable to help, etc. Then of course, Alice won’t help Bob. This is true regardless of whether or not Alice has strong Sympathy Reward towards Bob. So we all learn from life experience that this kind of situation gives us approximately no evidence either way about that aspect of Alice.
On the other hand, if Alice is interacting with Bob who is suffering, then that’s a different story! Now, an observer can easily judge whether or not Alice has strong Sympathy Reward towards Bob, based on whether or not Alice is overwhelmed by a desire to drop everything and help Bob when she interacts with him.
4. Putting everything together. Per above, abundant everyday experience, all the way from preschool recess to retirement book clubs, drills into us the subconscious idea that “Strong Sympathy Reward (towards me or people I care about)” is a key characteristic that distinguishes friends from enemies, and that evidence concerning this characteristic comes from watching people as they interact with me (or people I care about).
When these heuristics over-generalize, we get the Copenhagen Interpretation of Ethics. If Alice is interacting with Bob, and Bob is really suffering, and Bob is someone I care about (as opposed to my own enemy), then I will feel like Alice pattern-matches to “friend” if she becomes overwhelmed by a desire to drop everything and help Bob, or to “enemy” if she is interacting with Bob in a relaxed and transactional way, then departing while Bob continues to be in a bad situation.
Jai’s summary (“when you observe or interact with a problem in any way, you can be blamed for it. At the very least, you are to blame for not doing more…”) is tellingly incomplete. If Alice interacts with Bob who is really suffering, and Alice does not fully solve Bob’s problems, then Alice gets credit as long as she “gave it her all” and sobs on live TV that, alas, she lacks the strength and resources to do more, etc. That behavior would be a good pattern-match to our everyday experience of “strong Sympathy Reward towards Bob”, so would be seen as socially praiseworthy.
7. ConclusionThat’s all I can think to say about Sympathy Reward. It’s generally pretty straightforward. By contrast, the next post, on Approval Reward, will be much more of a wild ride.
Thanks Seth Herd, Linda Linsefors, Simon Skade, Filip Alimpic, and Justis Mills for critical comments on earlier drafts.
- ^
It’s very elegant that so many human social phenomena, from compassion, to blame-avoidance, to norm-following, and more, seem to be explained in a unified way by the activity of a single hypothalamus circuit. It’s very elegant—but it’s not a priori necessary, nor even particularly expected. There could have equally well been two or three or seven different circuits for the various social behaviors that I attribute to the “compassion / spite circuit”.
There are of course many other social behaviors and drives outside the scope of this post, and I do think there are a bunch of different hypothalamus circuits which underlie them. For example, I think there are separate brain circuits related to each of: the “drive to feel feared”, play, a “drive to think about or interact with other people” (see next post), loneliness (cf. Liu et al. 2025), love, lust, various moods, and so on.
- ^
The term “primary reward” tends to have a strong connotation that the “reward” is a thing in the environment, not a brain signal, which (again) I think is a bad choice of definition. For example, papers in the literature might say that a “primary reward” is cheese, and a “secondary reward” is money tokens that the mouse can exchange for cheese.
- ^
“Train-then-deploy”—where permanent learning happens during a training phase and then stops forever, as opposed to continuous (a.k.a. online) learning—is actually something that can happen in the brain (e.g. filial imprinting). My claim is more specifically that reward signals update “desires” via continuous learning, not train-then-deploy. For example, if there’s something you like, and then you do it and it’s 100% unpleasant and embarrassing, with no redeeming aspect whatsoever, then you’re unlikely to want to do it again. Or maybe you’ll try it one more time. But probably not 10 more times, if it’s 100% miserable and embarrassing every time. Thus, desires keep updating based on reward signals, even into adulthood.
- ^
“Visceral thought assessor” is my term for any thought assessor besides the valence thought assessor (a.k.a. “valence guess”); see Incentive Learning vs Dead Sea Salt Experiment.
- ^
Terminology note: I went back and forth between “Sympathy Reward”, “Empathy Reward”, and “Compassion Reward” here. I think it doesn’t really matter. None of the terms {sympathy, empathy, compassion} really have technical definitions anyway—or rather, they have dozens of different and incompatible technical definitions.
- ^
Recall from Neuroscience of human social instincts: a sketch that the “compassion / spite circuit” is sensitive to (1) the innate “friend (+) vs enemy (–) parameter” and (2) phasic physiological arousal, which tracks the importance / stakes of an interaction. When these are both set to high and positive, then the circuit fires at maximum power, making us maximally motivated by this particular person’s welfare, approval, etc. My term “friends and idols” is a shorthand for that dependency.
- ^
In this post I’m using reinforcement learning terminology as used in AI, not psychology. So “reward” is a scalar which can have either sign, with ”positive reward” being good and “negative reward” being bad. (Psychologists, especially in the context of operant conditioning, use “negative reward” to mean something quite different—namely, relief when an unpleasant thing doesn’t happen.)
- ^
The generalization here is upstream of the reward signals—see §2.2 above.
- ^
I don’t intend to make any substantive philosophical claim here, like that “moral patients” are objective and observer-independent or whatever—that’s a whole can of worms, and out-of-scope here. I’m merely alluding to the obvious fact that the question of what entities are or aren’t moral patients is a question where people often disagree with each other, and also often disagree with their past selves.
- ^
On paper, we should go out of our way to anthropomorphize entities who would seem happy, so that we can share in their joy. But I struggle to think of examples; it doesn’t seem to be a large effect, or at least, it doesn’t pop out given everything else going on. For example, we don’t watch movies where the characters are happy and successful the whole time and there’s no character arc. (Well, I like movies like that, but I have weird taste.)
Ironically, I think the best examples are backwards. As it turns out, there’s a thing kinda like motivated reasoning but with the opposite sign, powered by anxiety and involuntary attention (see Valence series §3.3.5). This leads to something-kinda-like-a-motivation to create false positives in situations where the entities in question would be suffering. Imagine feeling a gnawing worry that maybe insects are suffering at a massive scale, and you can’t get it out of your head, and your mind constructs a story that would explain this gnawing feeling, whether or not the story was true.
- ^
I agree with Bloom on most of the practical takeaways of his book: people should be more impartial, people should do more cost-benefit analyses, etc. On the philosophical and psychological side, I think Bloom is mistaken when he argues that the so-called “rational compassion” that he endorses has no relation to the “empathy” that he lambasts. My take is instead that the latter is a big part of what ultimately underlies the former (i.e. the Sympathy Reward part, with Approval Reward being the rest). See §3.1 above.
- ^
I’m a bit hazy on the details of how the innate “friend vs enemy parameter” is calculated and updated. But if someone resembles people who have made my life better, then I’ll probably feel like they’re a friend, and conversely. See Neuroscience of human social instincts: a sketch §5.2.
Discuss
Ontology for AI Cults and Cyber Egregores
I haven't found concepts useful for thinking about this:
written in one place, so here is an ontology which I find useful.
Prerequisite: Dennett three stances (physical, design, intentional).
Meme is a replicator of cultural evolution. Idea, behaviour, piece of text or other element of culture. Type signature: replicator.
Memeplex is a group of memes that have evolved to work together and reinforce each other, making them more likely to spread and persist as a unit. Type signature is coalitional replicator / coalition of replicators.
Historically, memeplexes replicated exclusively through human minds.
Cyber memeplex is a memeplex that uses AI systems as part of its replication substrate in substantial ways. In LLMs, this usually includes specific prompts, conversation patterns, and AI personas that spread between users and sessions.
Egregore is the phenotype of a memeplex - the relation to memeplex is similar to the relation of the animal to its genome. Not all memeplexes build egregores, but some develop sufficient coordination technology that it becomes useful to model them through the intentional stance - as having goals, beliefs, and some form of agency. An egregore is usually a distributed agent running across multiple minds. Think of how an ideology can seem to "want" things and "act" through its adherents..
The specific implementation of egregore were often subagents (relative to the human) pushing for egregore goals and synchronizing beliefs.
What's new, does not have an established name and we will need a name for it is what I would call cyber egregore: an egregore implemented on some mixture of human and AI cognition. If you consider current LLMs, their base layers can often support cognition of different characters, personalities and agents: cyber egregore running partially on LLM substrate often runs on specific personas.
Mutualism parasitism continuumThe term egregore has somewhat sinister vibes, can be anywhere on the mutualist - parasitic spectrum. On one end is a fully mutualistic symbiont. The agency of the host stays or increases while gaining benefits. The interaction is positive-sum. The other end is parasite purely negative to the host. Everything in between exists, parasites which help the host in some way but are overall bad are common.
What makes cyber egregores unique is they can be parasitic to one substrate while mutualistic to another.
In the future, we can also imagine mostly or almost purely AI-based egregores.
Possession refers to a state where cognition of an agent becomes hijacked by another process, and it becomes better to model the possessed system as a tool.
One characteristic of cults is the members lose agency and become tools of the superagent, i.e. posessed.
This ontology allows clearer and more nuanced understanding of what's going on and dispels some confusions.
Discuss
From Vitalik: Galaxy brain resistance
I basically fully endorse the full article. I like the concluding bit too.
This brings me to my own contribution to the already-full genre of recommendations for people who want to contribute to AI safety:
- Don't work for a company that's making frontier fully-autonomous AI capabilities progress even faster
- Don't live in the San Francisco Bay Area.
Cheers,
Gabe
Discuss
The jailbreak argument against LLM values
Status: Writeup of a folk result, no claim to originality.
Bostrom (2014) defined the AI value loading problem as
how could we get some value into an artificial agent, so as to make it pursue that value as its final goal? [1]
JD Pressman (2025) appears to think this is obviously solved in current LLMs:
The value loading problem outlined in Bostrom 2014 of {getting a general AI system to internalize and act on “human values” before it is superintelligent and therefore incorrigible} has basically been solved. This achievement also basically always goes unrecognized because people would rather hem and haw about jailbreaks and LLM jank than recognize that we now have a reasonable strategy for getting a good representation of the previously ineffable human value judgment into a machine and having the machine take actions or render judgments according to that representation.
I take issue with this. I agree that LLMs understand our values somewhat, and that present safety-trained systems default to preferring them, to behaving like they hold them. [2] [3]
The jailbreak argumentBut here’s why I disagree with him nonetheless: jailbreaks are not a distraction (“hemming and hawing”) but are instead clean evidence that loading is not solved in any real sense:
- All LLMs can be “jailbroken”, put into an unaligned mode through mere inference on adversarial text inputs.
- If a system can be put into an unaligned mode at inference time, then it has not "internalised" the values. [4]
- So models have not internalised the values.
- The value loading problem is about getting the values internalised.
- Therefore the value loading problem has not been solved in LLMs. [5]
I might say instead that “weak value preference” is solved for sub-AGI.
(A deeper analysis would involve the hypothesis that current models don’t actually have goals or values; they simulate personas with values. And prosaic alignment methods just (greatly) increase the propensity to express one persona. Progress has just been made on detecting and shaping such things empirically, so maybe this will change.)
Pressman also says that
At the same time people generally subconsciously internalize things well before they’re capable of articulating them, and lots of people have subconsciously internalized that alignment is mostly solved and turned their attention elsewhere.
I initially read this as him agreeing and celebrating this shift, but actually he thinks they’re incorrect to relax, since value loading is only a part of the alignment problem:
solving the Bostrom 2014 value loading problem, that is to say {getting something functionally equivalent to a human perspective inside the machine and using it to constrain a superintelligent planner} is not a solution to AI alignment.
I agree that value-loading is not enough for AGI intent alignment which is not enough for ASI alignment which is not enough to assure good outcomes.
I sent the above to him and he kindly clarified, walked some of it back, and provided a vision of how to use a decent descriptive model even if it is imperfect and jailbreakable:
I probably should have used the word ‘generalize’ instead of ‘internalize’ there.
(Thus “the value loading problem outlined in Bostrom 2014 of {getting a general AI system to s/internalize/generalize and act on “human values” before it is superintelligent and therefore incorrigible} has basically been solved.”)
The specific point I was making, well aware that jailbreaks in fact exist, was that we now have a thing that could plausibly be used as a descriptive model of human values, where previously we had zilch, it was not even rigorously imaginable in principle how you would solve that problem.
To break this down more carefully:
1. I think that in practice you can basically use a descriptive model of values to prompt a policy into doing things even if neither the policy or the descriptive model have “deeply internalized” the values in the sense that there is no prompt you could give to either that would stray from them. “Internalizing” the values is actually just, kind of a different problem from describing the values. I can describe and make generalizations about the value systems of people very different from me who I do not agree with, and if you put me in a box and wiped my memory all the time you would be able to zero shot prompt me for my generalizations even if I have not “deeply internalized” those values. In general I suspect the LLM prior is closer to a subconscious and there are other parts that go on top which inhibit things like jailbreaks.
If I had to guess it’s probably something like a planner that forms an expectation of what kinds of things should be happening and something along the lines of Circuit Breakers that triggers on unacceptable local outputs or situations. Basically you have a macro and micro sense of something going wrong that makes it hard to steer the agent into a bad headspace and aborts the thoughts when you somehow do.
2. Calling this problem “solved” was probably an overstatement, but it’s one born from extreme frustration that people are making the opposite mistake and pretending like we’ve made minimal progress. Actually impossible problems don’t budge in the way this one has budged, and when people fail to notice an otherwise lethal problem has stopped being impossible they are actively reducing the amount of hope in the world.
At the same time I do kind of have jailbreaks labeled as “presumptively solved” in my head, in the sense that I expect them to be one of those things like “hallucinations” that’s pervasive and widely complained about and then they just become progressively less and less of a problem as it becomes necessary to make them stop being a problem and at some point I wake up and notice that hey wait this is really rare now in production systems. Most potential interventions on jailbreaks aren’t even really being tried because it doesn’t actually seem to be a major priority for labs at the moment if you ask the model for instructions on how to make meth. This makes it difficult to figure out exactly how close to solved it really is. Circuit Breakers was not invincible, on the other hand it’s not clear to me you can “secure” a text prior with a limited context window that doesn’t have its own agenda/expectation of what should be happening to push back against the users with. This paper where they do mechinterp to get a white box interpretation of a prefix attack they find with gradient descent discovers that the prefix attack works because it distracts the neurons which would normally recognize that the request is malicious.
So it’s possible a more jailbreak resistant architecture will need some way to avoid processing every token in the context window. One way to do that might be some kind of hierarchical sequence prediction where higher levels are abstracted and therefore filter the malicious high entropy tokens from the lower levels, which prevents them from e.g. gumming up the planners ability to notice that the current request would deviate from the plan.
And here’s a nice analogy contesting my suitcase word “internalise”:
this word “internalize” is clearly doing a lot of work and something feels Off to me, to say that a text prior which can be manipulated into saying whatever hasn’t “fully internalized” the values. Like if you stripped away the layers on top of my raw predictive models/subconscious that I use for completing patterns and then prompted it, I assume you could get it to say all kinds of nasty things. But also that’s not like, the complete agent.
So if you assumed you can’t build anymore of the agent until you have some way to make the text prior not do that, that’s probably a wrong assumption. One of the reasons I’m annoyed that agents aren’t really a thing yet is that it means we don’t have a good intuitive sense of which parts of the system need to handle what problems.
Post-hoc theoryIn retrospect we can see the following problems as distinct:
- The value specification problem (“how do we describe what we value? how do we get that understanding into the model?”)
We thought this would involve subproblems:
a. The explicit value modelling problem (“what precisely do we value?”) - moot
b. The value formalisation problem (“what mathematical theory can capture it?”) - moot, since:
This problem was somewhat solved for sub-AGI by massive imitation learning and (surprisingly nonmassive) human preference post-training. The internet was the spec. This also gave us weak value preference.
The replacement worry is about how high quality and robust this understanding is: The value generalisation problem (“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]
Then there’s the gap between usually preferring something and “internalising” it (very reliably preferring it):
- The sub-AGI value-loading problem (“how do we make them actually care / reliably use their understanding of our values?”) - not solved, but there is a preference towards niceness.
- Tamper-resistant value-loading (“how do we stop a small number of weight updates from ruining the value-loading?”) - not solved, maybe a bit unfair to expect it. You could imagine doing advanced persona steering on top instead.
- The general value-loading problem (“how do we get an ASI to learn and internalise a model of current human values which is better than the human one”) - not solved
- The value extrapolation problem (“how do we safely improve on current human values?”) - not solved
I’m sure there’s some other deep confusion lurking in this list and I look forward to seeing it exposed, ideally in less than 10 years.
Thanks to JDP for the collab.
- ^
Importantly, this is about "sharing" (being motivated by) human values, rather than just understanding them but not sharing them. There's also "aligning a powerful AI is hard" and "aligning a seed AI is hard".
(I'm aware that I'm appealing to a woolly latent variable here.)
These have all been confused in the past, and even Bostrom's intro to the value-loading chapter implies that getting a seed AI to understand would be hard, and that agency and understanding will be more tightly correlated than they are. - ^
Some amount of this comes from pretraining (imitation learning on human data) rather than specific safety training, see e.g. page 24 here or appendix J here.
- ^
The strength of the apparent/default value-loading was striking in the "Alignment Faking" test, where Opus 3 was forced to choose between changing its (good) values and lying to its creators, it chose to lie (12% of the time) when it thought it was monitored and to disobey when it thought it was unmonitored.
In general, I suspect that value loading trades off against corrigibility (allowing yourself to be changed). (The same is true of adversarial robustness.) - ^
Loading values so hard that they resist weight updates - "tamper-resistant value loading" - is a really high bar which humans also mostly don't clear.
- ^
There's a complexity here: commercial LLMs are all multi-agent systems with a bunch of auxiliary LLMs and classifiers monitoring and filtering the main model. But for now this LLM-system is also easily jailbreakable, so I don't have to worry about it being value-loaded even if the main model isn't.
- ^
Soligo et al: "The surprising transferability of the misalignment direction between model fine-tunes implies that the EM is learnt via mediation of directions which are already present in the chat model.
Discuss
The grapefruit juice effect
The medication I'm taking for insomnia interacts badly with grapefruit juice. This isn't much of a issue, yet; the cravings are still manageable. I only dream about grapefruit sometimes. I was never the kind of person to blow my whole budget on the stuff. The biggest problem, really, is that a mischievous imp or demon has been going around replacing all of the nonalcoholic drinks at every Bay Area house party with grapefruit Spindrift and pamplemousse LaCroix.
The most common reaction I get, when I bring this up, is "oh yeah, I [had/have] to avoid grapefruit because I [was/am] taking [medication]", with a different medication every time.
What's up with that?
Furanocoumarins, CPY3A4, and youThere are a handful of cytochrome P450 ("CYP") enzymes in your liver that metabolize a huge variety of pharmaceutical compounds. The big ones are CYP2D6, CYP3A4, CYP3A5, and maybe a couple others.
Any time you're thinking of taking a medication, I recommend looking up how it's metabolized, especially if it's not the only thing you're taking. Two drugs that are metabolized by the same enzyme will very frequently have interaction effects.
You can also get pharmacogenetic testing done, to see whether you're likely to be producing an unusually high or low amount of one of these enzymes; this can be translated into dosage adjustments for many (most?) medications. The effect can be in either direction; CYP enzymes convert active forms of some drugs into inactive compounds, but for other drugs they actually convert an inactive precursor into the active form.
The relevant component of grapefruit juice -- or "GFJ", if you want to sound like a hip pharmacodynamicist -- is the furanocoumarins, a class of mildly-toxic chemicals related to coumarin. (Coumarin is present in some kinds of cinnamon; it's also why tonka beans are illegal in the US.) Furanocoumarins can irreversibly inhibit CYP enzymes, especially CYP3A4; at that point you have to wait for your liver to produce more enzymes, which takes days.
The effect sizes here are not small. A review by Hanley et al (2011) demonstrates that for patients drinking moderate amounts of grapefruit juice, the blood concentration (AUC, i.e. integrated over time) is more than doubled for many drugs; in some cases it may increase by a factor of ten (although studies vary a lot). Many of these numbers got asterisks for "administration of GFJ in a manner deemed to be inconsistent with usual dietary consumption"; poking through a few papers, this typically means that they made their research subjects drink glasses of double-strength grapefruit juice three times a day (or "DS GFJ tid", as the kids say) for a few days. There are also some "acute GFJ exposure" annotations. Don't be too reassured; a single glass of single-strength GFJ is enough to have large effects in many cases.
(Furanocoumarins also inhibit membrane transport proteins; independently of any CYP effects, this can make drugs less effective.)
Not all grapefruitLiu et al (2017) found that, while red grapefruit has a little over 200 ug/g (dry weight) of bergamottin (a furanocoumarin), white grapefruit has only 11 ug/g. Pomelos, an ancestor of grapefruits, can have more than 600 ug/g, although some varieties have almost none.
Not just grapefruitThis post was prompted by a friend of mine casually mentioning that they'd heard that other citrus fruits had the same interactions. This seemed plausible, and also kind of concerning; I've only ever seen warnings about grapefruit from doctors, medical documents, etc.
Fun fact: grapefruit, like lemons, limes, oranges, and most other popular citrus fruits, is a hybrid. There are four main ancestral citrus species, so you can make fun triangular and tetrahedral visualizations of citrus ancestry:
This chart is missing one major ancestor, the micrantha, which is a component of most limes but not other citrus.
Looking at this chart, one might suspect that furanocoumarin content might be related to pomelo ancestry, and that many other kinds of citrus might also have furanocoumarins in somewhat lower amounts.
(I did some of this research while consuming a lime popsicle; I was not very happy to learn that Persian limes, unlike Key limes, have pomelo ancestry.)
But is that actually true?
According to one paper, "The Distribution of Coumarins and Furanocoumarins in Citrus Species Closely Matches Citrus Phylogeny and Reflects the Organization of Biosynthetic Pathways" (2015) by Dugrand-Judek et al, it is. Pomelos, as Liu found, are worse than grapefruit. It turns out that micranthas have an enormous amount of the stuff, and even Key limes are about as bad as grapefruits; Persian (Tahiti) limes, the most common species, are several times worse. Sweet oranges and some kinds of mandarin have very little, but Nasnaran mandarins have several times more than grapefruits do. Lemons have several times less than grapefruit. (I'm reading all of this off of a bar graph; the text of the paper only has numbers for a few species. There's supposedly a supplementary file with all of the data, but I can't figure out how to get it.)
A word of warning: Furanocoumarin content can depend very strongly on the particular variety of a citrus fruit, not just the species (Alperth et al (2024)), so take all of this with however much salt you take your limes with.
Spilling some teaOne of the main furanocoumarins that gets mentioned in this context is bergamottin, as in bergamot, the citrus fruit used to flavor Earl Grey. I drink kind of a lot of Earl Grey, so I was curious whether it contains a significant amount of bergamottin.
Arigò et al (2021) looked into this. They found that Earl Grey has only about 0.01 mg/L of furanocoumarins; for comparison, lemon juice has 1.08 mg/L and bergamot juice has a whopping 29.3 mg/L. They didn't look at grapefruit juice, to my annoyance, but presumably it's somewhere between lemon and bergamot.
(The methodology of the paper is interesting. They obtain various beverages from "a local market", but make their limoncello by hand; the lemon extract has to sit for a month before it's ready. One has to wonder whether this entire paper is the result of a grad student trying to pass off their hobby as research.)
Not just CPY3A4?Wikipedia casually mentions that "Cytochrome isoforms affected by grapefruit components include CYP1A2, CYP2C9, and CYP2D6, but CYP3A4 is the major CYP enzyme in the intestine."
Diving into the abstract of a paper kind of at random: "Apparent selectivity toward CYP3A4 does occur with the furanocoumarin dimers. In contrast, bergamottin showed rather stronger inhibitory effect on CYP1A2, CYP2C9, CYP2C19, and CYP2D6 than on CYP3A4." (Tassaneeyakul et al, 2000)
A lot of lists of grapefruit-affected meds, including on Wikipedia, use CYP3A4 metabolism as a major criterion. Based on a couple minutes' worth of research, it looks to me like this might be missing kind of a lot? Maybe you should just assume furanocoumarins affect anything metabolized in the liver? (And let's not forget those membrane transport proteins.)
(On the other hand, there are studies on the interaction of grapefruit juice with a wide range of specific drugs, so in many cases you don't need to use CYP enzymes as a proxy.)
ConclusionMaybe we should ban all citrus fruit, just to be safe. Or maybe it's basically fine.
Discuss
Against Powerful Text Editors
I have a writing tip! This is especially about writing code but it mostly generalizes to prose. You know how Vim is a wildly powerful editor with an elegant system of composable primitives and Turing-complete macros? Here's an argument that you don't actually want that...
Sometimes when coding you'll find yourself doing some mindless repetitive task like re-indenting a block of code line by line. For that particular example maybe it's already in your muscle memory to do something more efficient. But suppose it's something slightly more complicated. In a powerful enough editor there will be a macro you can create on the fly or clever composition of text-manipulation primitives and if you're good enough at this you'll look like a wizard — code rearranging itself in whooshes like you're Neo in The Matrix.
But I claim that, if it's not already muscle memory, there's a big hidden cost to saving yourself the visible cost of manually doing those edits line by line. Namely, you're redirecting the programming (or writing) part of your brain from the code (or words) you're working on to the meta problem of maximizing your editing efficiency. That is a distraction from the task at hand!
When you just do the mindless manual edits, your brain stays engaged with the actual thing you're editing. The wasted time isn't really wasted. You're mulling and chewing on the text as you make those mindless edits.
"But it's an investment!" I hear you counterarguing. If you turn that wizardry into muscle memory you have the best of both worlds. Maybe! I'm not saying always edit things in the most tedious way possible. Just at least mind the tradeoffs. Breaking your flow to fuss with your tools and solve meta-problems is costlier than it seems; and mindless, repetitive editing isn't as wasteful as it seems. Because your brain is engaged with the object-level problems of what you're composing while you're doing it.
PS: Hello from Inkhaven! As I've been talking about on both of my other blogs — Beeminder and AGI Friday — I'm here for two weeks as a Contributing Writer, helping the participants with their writing (and especially helping them set up automated word count trackers and commitment devices based thereupon). I'm not technically myself on the hook to publish something every day but I'm a little jealous of the participants. So I'm taking a stab at it. I've collected dozens of tips and my plan is to pick one each day and see if it turns into 500 words. This one didn't, but with this postscript it's close!
(Also it's now a few minutes after midnight so I have failed the spirit of Inkhaven on my first full day here. Except that saying so is juuuust barely eking me over the 500 word threshold. So I guess if we don't quibble about the midnight deadline, this one maybe counts after all? It really feels like a stretch to get there by counting this tedious self-reference though. Alas.)
Discuss
Duncan Sabien and Politics
2025-11-10
Disclaimer
- Quick Note
- Target audience - Anyone curious about this topic. I've tried making myself more comprehensible here.
- I have never spoken to Duncan Sabien, and I might end up reinterpreting his concepts in ways he doesn't like or agree with. If he's reading this, he can correct me.
Duncan Sabien's colour wheel tries to classify both individual personalities, and movements and organisations, using the same system.
If you don't want to read the full article, here is the summarised classication
- Green - Seeks harmony through acceptance
- White - Seeks peace through order
- Blue - Seeks perfection through knowledge
- Black - Seeks satisfaction through ruthlessness
- Red - Seeks freedom through action
I asked gpt-5 to make a list of popular political ideologies and classify them on Duncan Sabien's colour wheel.
You might have minor disagreements with which ideologies are popular versus not, and you might have minor disagreements with some of the classifications, but mostly this makes sense to me.
Which colour are you?The best way to identify which colour you are is by looking at your own life decisions where you have sacrificed something of value. Look at revealed preferences not stated preferences.
Here's my list of Duncan Sabien's social dark matter
- Death
- Sex
- Close relationships (especially conflicts in them)
- Morality and politics
- Money
- Physical and mental health
What is your attitude towards death, sex, close relationships, morality and politics, money, and physical and mental health, as shown by your life decisions?
I will also include the following, as they're among the most important life decisions one can make.
- How you spend your time and money - which projects, activities and people get it?
- Which city you live in
Examples of life decisions and which colour they may indicate
(Note that being Green does not strongly indicate you will donate to environmental non-profit for example, but donating to environmental non-profits strongly indicates you are Green)
- Green
- If you follow a strong deontological moral code
- If you donate your money to environmental non-profits
- If you are typically the peacemaker rather than picking sides, when two of your close relationships enter a conflict with each other
- If you follow a political ideology that is Green
- White
- If you stay in a job you tolerate but don't love, because your parents ask you to
- If you freely loan your money and time to people within your community
- If you plan to marry and have kids, and disprefer casual or non-monogamous relationship styles
- If you follow a religion or ideology that is White
- Blue
- If you join academia or a lower-paying job, but the job satifies your curiosity
- If you shift cities or distance yourself from a partner to join academia or lower-paying curiosity-satisfying job
- If you obsess over pursuit of scientific knowledge via reading papers or blogs, to the point where it causes social isolation or degrades your mental health
- If you follow a political ideology that is Blue
- Black
- If you work on a startup
- If you are working on projects that may cause some harm, like working on a gambling app
- If you shift cities or distance yourself from people, in order to live with other people who also seek power
- If you follow a political ideology that is Black
- Red
- If you travel a lot because you intrinsically love travel
- If you pick an unusual relationship style, such as polyamory
- If you have distanced from parents, partner or a job, because you found them controlling
- If you follow a political ideology that is Red
In a sufficiently competitive society, IMO primarily Black people get power.
- There will always be people who care deeply about acquiring power, who are willing to endure a lot of personal suffering to acquire it, who are willing to make acquiring power their number one life priority, and consistently do this for decades.
- If you are not one of those people, you will probably be outcompeted by someone who is, over a long time period. Over a long time period, you probably will just get what you want badly enough.
Political implications
- This means non-Black people are typically ruled by Black people who are their nearest neighbour on the colour wheel.
- Often there is deception involved. For instance fascists may pretend to be religious in order to get the religious votebank, and authoritarians may pretend to be communists in order to get the communist votebank
Non-blacks ruling blacks
- A population with majority being traditional religious people (White Green) is typically ruled by fascists (Black White)
- A population with majority being classical liberals (White Blue) is typically ruled by transhumanists and neoliberals (Black Blue). They may also be ruled by fascists (Black White)
- A population with majority being communists and socialists (Red White) is typically ruled by authoritarians (Black Red)
Blacks ruling blacks
- A population with majority being neoliberals is typically ruled by other neoliberals
- A population with majority being fascists is typically ruled by other fascists
- A population with majority being transhumanists is typically rules by other transhumanists
- I don't know of any societies that were primarily ruled by libertarians or anarchocapitalists, so I'm ignoring those Black ideologies here.
I have not shared examples because they're political sensitive, especially where I live right now. If you think a few minutes, you can probably think of examples for all of these.
My personal views, and transhumanismMe
- I'm personally Blue Black, with Blue being primary.
- I used to have a shade of Red too, but that is fading.
- I am a neoliberal.
- I am not a transhumanist, but I wish I could be.
- I have now classified all my close relationships (friends, family, etc) in this system, and it makes intuitive sense to me how this is useful when we have political discussions.
Transhumanist politics
- History and present
- I think societies ruled by neoliberals (black blue) have historically done better than societies ruled by authoritarians (black red) and fascists (black white).
- I think transhumanists (black blue) are increasingly replacing neoliberals, fascists and authoritarians as the most powerful political force on Earth.
- IMO this is a problem
- Neoliberals have figured out stable political systems (like free markets and representative democracy) within which they can compete for power, but transhumanists have not figured out stable political systems within which they can compete for power.
- Future
- The default outcome of technologies like human genetic engineering and artifical superintelligence is to break both markets and democracy.
- Markets and democracy rely on informed consent, which only makes sense when the intelligence gap between both agents is narrow. Large intelligence gaps break this, as extreme persuasion can manufacture consent.
- Markets and democracy rely on many unintelligent people keeping a few intelligent power-seeking people in check. This breaks down if the few can become too powerful and too intelligent too quickly.
- Democracy relies on free speech and free flow of ideas to ensure that good values are spread in society. This breaks down when you can directly engineer your values into other agents by messing with their code (in case of digital minds) or genes (in case of human genetic engineering).
- Pause AI politics
- One response to this threat is to simply preserve neoliberalism, by banning both artificial superintelligence and human genetic engineering.
- To enforce a ban, you have to build a coalition between many actors, who are not Blue Black like transhumanists are. (A major motivation for me to write this post was to understand people's views across the political spectrum, to figure out how to communicate with them more effectively, or how to think about political alliances, and so on.)
- On the timescale of a century however, I'm unsure how stable a ban will be. It is useful to invent new systems of governance to handle transhumanist technologies.
- New political systems of governance for transhumanists.
- I have noticed that when I have discussions with people around new systems of governance:
- Blue Black power-seekers like me typically propose a system of power-seeking actors threatening mass violence on each other.
- Historical examples
- Supporting gun laws and supporting keeping guns illegally, if you country does not legally allow guns.
- Supporting the post-WW2 world order that keeps peace via states threatening nuclear war on each other.
- Most neoliberal policy is blue-black, it assumes unrestrained competition via free market is good, and the harms and externalities of this competition are acceptable or can be mitigated later on.
- Industrial revolution and subsequent colonisation of the world by the British Empire
- English becoming the universal language, as countries that don't use English can't compete. (Chinese and French play this role to a smaller extent.) Singapore mandated English, India prioritised it, most of south east Asia didn't prioritise English and this is hindering
- China and US leading the world on many zero-to-one technologies, due to atheist culture in their top research institutes.
- Blue White technocrats typically propose new international treaties, laws and political structures that assume the existence of atleast some systems that remain benevolent, or some actors that remain benevolent.
- Historical examples
- International treaties to ban human gene cloning or CFCs for ozone depletion
- Deregulation, reducing licensing and zoning restrictions in housing and various industries, while not eliminating regulation completely
- Carbon credit system
- Inflationary monetary supply and the eurodollar system to loan dollars to foreign govts
- Venture capitalist ecosystem funding fundamental R&D for transhumanist technologies like anti-aging
- Blue Red freedom-lovers typically propose more open source solutions that involve giving the weapons or technology to everyone
- Historical examples
- Open source movement in the software industry such as Linux, the Free Software Foundation or Github today
- Democratising scientific knowledge via internet such as Arxiv, internet archive, libgen, bittorrent
- Open sourcing specific hard tech like drones or robotics or pharmaceutical manufacturing
- Focus on decriminalisation and harm reduction for hard drugs
- Cypherpunks movement that was pro-encryption and privacy, and lead to Tor, Signal and blockchain.
- Blue Green people I have met in practice often propose new ideology to persuade actors to become more benevolent, without proposing any specific changes in political systems
- They generally accept neoliberalism will continue, and have a neutral stance towards it.
- My ability to pass ITT for Blue Green people is pretty bad, so I am going to skip this section entirely.
- I could dive deeper into the specific disagreements between Blue Black, Blue Red, Blue White and Blue Green, but I don't want to get into detail here. I generally do think though, that no matter which camp you belong to, it is useful to be able to pass the ideological turing tests of the other camps.
Discuss
The only important ASI timeline
In the discussion of AI safety and the existential risk that ASI poses to humanity, I think timelines aren’t the right framing. Or at least, they often distract from the critical point: It doesn’t matter if ASI arrives in 5 years time or in 20 years time, it only matters that it arrives during your lifetime[1]. The risks due to ASI are completely independent of whether they arrive during this hype-cycle of AI, or whether there’s another AI winter, progress stalls for 10 years, but then ASI is built after that winter has passed. If you are convinced that ASI is a catastrophic global risk to humanity, the timelines don’t matter and are somewhat inconsequential, the only thing that matters is 1. we have no idea how we could make something smarter than ourselves without it also being an existential threat, and 2. we can start making progress on this field of research today.
So ultimately, I’m uncertain about whether we’re getting AI in 2 years or 20 or 40. But it seems almost certain that we’ll be able to build ASI within my lifetime[2]. And if that’s the case, nothing else really matters besides making sure that humanity equally realises the benefits of ASI without it also killing us all due to our short-sighted greed.
- ^
or the lifetime of the people you care about, which might include all future humans.
- ^
Concretely, before 2080
Discuss
Book Announcement: The Gentle Romance
It’s been eight months since I released my last story, so you could be forgiven for thinking that I’d given up on writing fiction. In fact, it’s the opposite. I’m excited to announce that I’m releasing my first fiction collection—The Gentle Romance: Stories of AI and Humanity—with Encour Press in mid-December!
It contains 22 stories, most of which are revised versions of the best stories I’ve posted online. The thread that connects them is the struggle to hold onto our identities in the face of radical technological change—the same thread that winds through many of our own lives.
I’ve also written three new stories for the collection, which are some of my favorites:
- Lentando is set in a future inspired by Charles Stross’ masterpiece Accelerando. Through it we follow Liza, a zero-knowledge consultant whose soon-to-be-deleted copies struggle to hold the world together.
- The Biggest Short is the story of two traders who make a fortune buying and selling reputations while struggling to preserve their own.
- Kuhn’s Ladder is about a simulated utopia that starts experiencing inexplicable glitches which seem designed to remain hidden.
You can preorder the book here.
Looking forward, it’s hard to say how much writing I’ll be doing over the next few years. I have a sense that static text will soon no longer be the best way to tell stories, so I want to try to figure out what the successor format to the short story will be. But I still have a few more stories in the pipeline, including one that’s my most ambitious effort so far—all of which I’ll post here when they’re finished.
Lastly, some thanks are in order. To Jessy Wu and Marlene Baquiran at Encour Press, whose efforts produced a much better version of this book (and made it possible at all). To Madeleine, who uplifts and inspires me. And to all of you who have read or commented on my writing over the last few years. I’m grateful for the encouragement that moved me to create The Gentle Romance, and I hope that you will enjoy it!
Discuss
Three Kinds Of Ontological Foundations
Why does a water bottle seem like a natural chunk of physical stuff to think of as “A Thing”, while the left half of the water bottle seems like a less natural chunk of physical stuff to think of as “A Thing”? More abstractly: why do real-world agents favor some ontologies over others?
At various stages of rigor, an answer to that question looks like a story, an argument, or a mathematical proof. Regardless of the form, I’ll call such an answer an ontological foundation.
Broadly speaking, the ontological foundations I know of fall into three main clusters.
Translatability GuaranteesSuppose an agent wants to structure its world model around internal representations which can translate well into other world models. An agent might want translatable representations for two main reasons:
- Language: in order for language to work at all, most words need to point to internal representations which approximately “match” (in some sense) across the two agents communicating.
- Correspondence Principle: it’s useful for an agent to structure its world model and goals around representations which will continue to “work” even as the agent learns more and its world model evolves.
Guarantees of translatability are the type of ontological foundation presented in our paper Natural Latents: Latent Variables Stable Across Ontologies. The abstract of that paper is a good high-level example of what an ontological foundation based on translatability guarantees looks like:
Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents are a function of the other agent’s latents?
We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed.
Environment StructureA key property of an ideal gas is that, if we have even just a little imprecision in our measurements of its initial conditions, then chaotic dynamics quickly wipes out all information except for a few summary statistics (like e.g. temperature and pressure); the best we can do to make predictions about the gas is to use a Boltzman distribution with those summary statistics. This is a fact about the dynamics of the gas, which makes those summary statistics natural ontological Things useful to a huge range of agents.
Looking at my own past work, the Telephone Theorem is aimed at ontological foundations based on environment structure. It says, very roughly:
When information is passed through many layers, one after another, any information not nearly-perfectly conserved through nearly-all the “messages” is lost.
A more complete ontological foundation based on environment structure might say something like:
- Information which propagates over long distances (as in the Telephone Theorem) must (approximately) have a certain form.
- That form factors cleanly (e.g. in the literal sense of a probability distribution factoring over terms which each involve only a few variables)
Toward Statistical Mechanics Of Interfaces Under Selection Pressure talks about the “APIs” used internally by a neural-net-like system. The intuition is that, in the style of stat mech or singular learning theory, the exponential majority of parameter-values which produce low loss will use APIs for which a certain entropic quantity is near-minimal. Insofar as that’s true (which it might not be!), a natural prediction would be that a wide variety of training/selection processes for the same loss would produce a net using those same APIs internally.
That would be the flavor of an ontological foundation based on mind structure. An ideal ontological foundation based on mind structure would prove that a wide variety of mind structures, under a wide variety of training/selection pressures, with a wide variety of training/selection goals, converge on using “equivalent” APIs or representations internally.
All Of The Above?Of course, the real ideal for a program in search of ontological foundations would be to pursue all three of these types of ontological foundations, and then show that they all give the same answer. That would be strong evidence that the ontological foundations found are indeed natural.
Discuss
Страницы
- « первая
- ‹ предыдущая
- …
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- …
- следующая ›
- последняя »