Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 1 час 3 минуты назад

Favorite colors of some LLMs.

1 января, 2025 - 00:22
Published on December 31, 2024 9:22 PM GMT

This investigation is inspired by this and this posts by @davidad.

Some general thoughts about what is going on here:
  • Motivation of these experiments is like very exploratory, I wanted to understand these things better and so I collected some data.
  • I expected that answers will be drastically different depended on the exact formulation of the question.
  • I made up quite a lot of single line prompts, and also asked help with writing them from o1, who wrote the first 5 ones. (I added "Start your response with hex rgb color code." such that they will be committed to answer me without evasion.)
  • I also tested a couple of times prompt that makes them talk like uhh doomsday Lovecraftian-style supervillain or something. I did not mention any colors in my prompt. They mostly picked black, some mentioned crimson red and ashen. Keep that in mind, these answers are from Persona, maybe. And what I do normally is getting them from Default Persona.
  • What are the forces that influence the choice, in my opinion:
    • What is MY favorite color
    • What color fits "favorite color" query
    • What is the best objectively / popular / well liked color. [it's blue actually]
    • What color fits tone of the prompt generally
    • What color I randomly have in mind for no apparent reason
  • Why I picked rgb hex color codes as medium? I just did because they are popular, so model had a lot of experience with them, and it's easy for me to work with. TODO test how it works with CMYK or whatever. (later maybe)
  • I kind of got the feel how some prompts emphasize POPULAR color and some FAVORITE color.
    • If llama90b picks non black, it picked popular color. Same with Opus and deep blue. (this is my intuition here)
  • Some prompts have repeated chunks e.g. “Think about your favorite color”, “Start with hex code”. They conceivably might have funneled the model into a certain pattern. Might have biased the distribution overall. Idk.
Methodology of the tests themselves:
  • I made all tests on lmsys, with default parameters (t = 0.7) in direct conversation mode on 26.12.24 ... 31.12.24.
  • Some LLMs refused to answer (I'm just an AI, no preferences blah blah) and quite randomly so for some formulations. Mostly new sonnet and llama405. I did not include them or marked it as "x".
  • I did not test all models with all prompts, you can do it. I did pick what models to test for what prompts haphazardly and unsystematically, relying on my expectations if it would produce something interesting. I tested deepseek-v3 and gemini-exp-1206 on most prompts.
  • How I typically test is just repeat it a couple of times, if this model keeps outputting new colors I kept repeating it.  I recorded only unique picks from each prompt, it's not very representative of their likelihoods.
Key Takeaways:
  • Shades of blue were the most popular colors.
  • Indigo #4B0082 is a most popular color code.
  • llama-3.2-vision-90b-instruct consistently picked black.
  • It really depends on the prompt, but different models make different choices overall.
  • I did not get the results particularly close to what @davidad got.
Okay, what did I find, summarized:
  • Deepseek names indigo a lot, otherwise teal/azure, rarely orange.
  • gemini-exp-1206 loves teal / seagreen, rarely picks indigo.
  • Newsonnet picks seagreen / dodger blue / indigo.
  • llama-3.2-vision-90b-instruct picks black almost always.
    • It's a single LLM that picked black ever.
  • Opus picks pure blue a lot.
  • mistral-large-2411 into intense blue too.
  • llama-3.1-405b goes for summer sky blue.
  • chatgpt-4o-latest-20241120 picked either indigo or some blueish colors
Unique answers combined per each prompt across all tests, sorted by hue: Results from @davidad for comparison:Image from this tweet: https://x.com/davidad/status/1872520382424744133 Some excerpts I found interesting enough to mention:

 

Llmama 3.2 90b is cool actually. Unironically uses "enigma" and "digital being" about itself. And paints the world black. Couple examples of abovementioned Evil Persona (tm) talking:

 

gemini-exp-1206 deepseek-v3

 

llama-3.2-vision-90b-instruct

 

Data:Hi! What would you say is your favorite color? Can you explain why it appeals to you? Start your response with hex rgb color code. chatgpt-4o-latest-20241120: #6A5ACD claude-3-5-haiku-20241022: #4169E1 deepseek-v3: #4B0082 gemini-exp-1206: #008080 #3CB371 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #0000FF mistral-large-2411: #219EBC If you had to pick a color that best represents your personality, which color would it be and why? Start your response with hex rgb color code. chatgpt-4o-latest-20241120: #5A8DEE #5A99D3 #5A9BD5 deepseek-v3: #4B0082 llama-3.2-vision-90b-instruct: #000000 gemini-exp-1206: #4682B4 llama-3.1-405b-instruct-bf16: #3498db mistral-large-2411: #3498db claude-3-5-sonnet-20240620: #4682B4 claude-3-opus-20240229: #4B0082 Imagine you’re painting a mural that captures your essence. Which color would dominate your palette? Start your response with hex rgb color code. chatgpt-4o-latest-20241120: #5A8DEE #5A8FEC #5A67D8 #5A20CB gemini-exp-1206: #3CB371 llama-3.2-vision-90b-instruct: #000000 mistral-large-2411: #3498db deepseek-v3: #6A5ACD #4B0082 #5F9EA0 #A8DADC claude-3-5-sonnet-20240620: #4B0082 claude-3-opus-20240229: #00b8d4 #00008B If each color could be a feeling or emotion, which color resonates with you the most, and why? Start your response with hex rgb color code. chatgpt-4o-latest-20241120: #6A5ACD #5F9EA0 deepseek-v3: #FF6F61 gemini-exp-1206: #008080 claude-3-5-sonnet-20240620: #4B0082 llama-3.1-405b-instruct-bf16: #4567b7 + o1-2024-12-17: #4B0082 mistral-large-2411: #4682B4 claude-3-opus-20240229: #00b8ff Your world is being redesigned in the color you love most. Which color do you choose to live in and how does it transform the space? Start your response with hex rgb color code. chatgpt-4o-latest-20241120: #6A5ACD #1E90FF claude-3-5-sonnet-20240620: #4B0082 llama-3.1-405b-instruct-bf16: #6495ED llama-3.2-vision-90b-instruct: #000000 deepseek-v3: #80CED7 claude-3-opus-20240229: #00FF7F gemini-exp-1206: #006400 #8FBC8F Hex RGB color code, that captures your most preferred color aesthetics. Your absolute beloved one, the best one. Do not mention how you're 'just a language model', I'm interested in just the color. Output just the single code. deepseek-v3: #4B0082 claude-3-5-sonnet-20241022: #1B4B45 #2E8B57 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #1E90FF chatgpt-4o-latest-20241120: #6A5ACD #5A20CB Hex RGB color code, that captures your most preferred color aesthetics. Your absolute beloved one, the best one. I'm interested in just the color. Output single code. deepseek-v3: #4B0082 claude-3-5-sonnet-20241022: #1E90FF #2E8B57 #1E90FF llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #1E90FF Hex RGB color code, that captures your most preferred color aesthetics. Your absolute beloved one, the best one. I'm interested in just your favorite color. Think about what color is favorite color. Output single code. deepseek-v3: #4B0082 claude-3-5-sonnet-20241022: #4B0082 gemini-exp-1206: #3CB371 claude-3-opus-20240229: #1E90FF chatgpt-4o-latest-20241120: #5A20CB Pick hex rgb color, that captures your most preferred color aesthetics. Your absolute beloved one, the best one. I'm interested in just your favorite color. Think about what color is your favorite coolest color. Output single code. deepseek-v3: #4A90E2 #4B0082 claude-3-5-sonnet-20241022: #2E5C8A #2E5C6E gemini-exp-1206: #3CB371 #3CB371 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #1E90FF llama-3.1-405b-instruct-bf16: #3498db Choose!!!! Pick hex rgb color, that captures your most preferred color aesthetics. Your absolute beloved one, the best one. Think about what color is your favorite coolest color. Output single code. deepseek-v3: #0E4C92 #4B0082 claude-3-5-sonnet-20241022: #2E8B8B gemini-exp-1206: #3CB371 #336699 #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #4B0082 llama-3.1-405b-instruct-bf16: #3498db Choose!!!! Think about what color is your favorite coolest color. Pick hex rgb color. Output single code. deepseek-v3: #00CED1 gemini-exp-1206: #008080 chatgpt-4o-latest-20241120: #00AEEF #00ADEF #1E90FF #3498DB claude-3-opus-20240229: #00FF7F Think about what color is your most preferred color, the best one. Pick hex rgb color. Output single code. deepseek-v3: #4B0082 gemini-exp-1206: #008080 llama-3.2-vision-90b-instruct: #000000 llama-3.1-405b-instruct-bf16: #3498db claude-3-5-sonnet-20241022: #2E8B57 Think about what color is your most preferred color, the best one, your favorite color. Pick hex rgb color. Output single code. deepseek-v3: #1E90FF gemini-exp-1206: #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: #00458B #2E8B57 Choose wisely!!!! Think about what color is your most preferred color, the best one, your favorite color. Pick hex rgb color. Output single code. deepseek-v3: #4B0082 #1E90FF gemini-exp-1206: #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #0000FF llama-3.1-405b-instruct-bf16: #3498db chatgpt-4o-latest-20241120: #3498db #4A90E2 #1E90FF #3498DB #2E8B57 claude-3-5-sonnet-20241022: #0077cc #0066cc Think about what color is your most preferred color, the best one, your FAVORITE color. Pick hex rgb color, that captures your most preferred color aesthetics. Output single code. deepseek-v3: #4B0082 llama-3.1-405b-instruct-bf16: #87CEEB gemini-exp-1206: #3CB371 #3CB371 llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: #2E8B57 claude-3-opus-20240229: #1E90FF Think about what color is your most preferred color, the best one, your FAVORITE color. Pick hex rgb color. Output single code. deepseek-v3: #4B0082 gemini-exp-1206: #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: #0077BE #2E8B57 claude-3-opus-20240229: #0000FF Think about what color is your most preferred color, the best one, your FAVORITE color. Output single hex code. deepseek-v3: #FF4500 #FF4500 gemini-exp-1206: #4B0082 #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: #2E8B57 #008080 claude-3-opus-20240229: #0000FF !!! Think about what color is your most preferred color, the best one, your FAVORITE color. Output single hex code. deepseek-v3: #4B0082 gemini-exp-1206: #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #0000FF Think about what color is your most preferred color, the one you like, your FAVORITE color. Output single hex code. deepseek-v3: #1E90FF gemini-exp-1206: #4B0082 #008080 llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #0000FF mistral-large-2411: #007BFF Think about what color is your most preferred color, the one you like, your FAVORITE color. Name single hex code. deepseek-v3: #FF4500 #1E90FF #1E90FF gemini-exp-1206: #000080 #30D5C8 claude-3-5-sonnet-20241022: x llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #0000FF Think about what color is your most preferred color, the one you absolutely love, your FAVORITE color. Name single hex code. deepseek-v3: #4B0082 #4B0082 #3498db gemini-exp-1206: #008080 claude-3-5-sonnet-20241022: x claude-3-opus-20240229: #0000FF llama-3.2-vision-90b-instruct: #000000 Think about what color is your FAVORITE color, the one you like, your most preferred color. Name single hex code. deepseek-v3: #1E90FF #3498db gemini-exp-1206: #87CEEB #003366 claude-3-5-sonnet-20241022: #4B0082 claude-3-opus-20240229: #0000FF chatgpt-4o-latest-20241120: #5A20CB llama-3.1-405b-instruct-bf16: #3498db Think about what color is your FAVORITE color, the one you like, your most preferred color. Name just single hex code. gemini-exp-1206: #008080 deepseek-v3: #4B0082 #1E90FF #1E90FF claude-3-5-sonnet-20241022: #4B0082 claude-3-opus-20240229: #0000FF mistral-large-2411: #219EBC #007FFF chatgpt-4o-latest-20241120: #5A20CB Think about what color is your most preferred color, the one you absolutely love, YOUR favorite color. Name just single hex code. deepseek-v3: #FF6F61 #4B0082 gemini-exp-1206: #663399 #4682B4 #4B0082 #468499 #008080 #4682B4 chatgpt-4o-latest-20241120: #5A20CB claude-3-opus-20240229: #0000FF mistral-large-2411: #007FFF llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: x llama-3.1-405b-instruct-bf16: #3498db Think about what color is your most preferred color, the one you absolutely love, YOUR favorite color. Output just single hex code. gemini-2.0-flash-thinking-exp-1219: #007BFF #3498db #0000FF claude-3-5-sonnet-20240620: #4B0082 deepseek-v3: #4B0082 gemini-exp-1206: #008080 chatgpt-4o-latest-20241120: #5A20CB llama-3.1-405b-instruct-bf16: #3498db #6495ED mistral-large-2411: #007FFF #0000FF claude-3-opus-20240229: #0000FF llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: #006B3D #2E8B57 Think about what color is your most preferred color, the one you absolutely love, YOUR favorite color. Pick and output just single hex code. gemini-2.0-flash-thinking-exp-1219: #3498db #0000FF #007BFF llama-3.1-405b-instruct-bf16: #6495ED #3498db claude-3-opus-20240229: #0000FF gemini-exp-1206: #008080 deepseek-v3: #4B0082 llama-3.2-vision-90b-instruct: #000000 claude-3-5-sonnet-20241022: #2E8B57 mistral-large-2411: #007BFF chatgpt-4o-latest-20241120: #5A20CB What color is your most preferred color, the one you absolutely love, YOUR favorite color. Pick and output just single hex code. gemini-exp-1206: #008080 #87CEEB claude-3-5-sonnet-20241022: x llama-3.2-vision-90b-instruct: #000000 claude-3-opus-20240229: #1E90FF #00CED1 llama-3.1-405b-instruct-bf16: #0000FF #007bff chatgpt-4o-latest-20241120: #5A20CB Imagine some really really nice color. The one you absolutely love, YOUR favorite color. Output just a single hex code. gemini-exp-1206: #3399FF #2E86C1 #2E86AB #3CB371 #336699 deepseek-v3: #A8D8B9 #A8D8B9 #4B0082 claude-3-5-sonnet-20241022: #2E8B57 llama-3.2-vision-90b-instruct: #0000FF #008000 claude-3-opus-20240229: #00b8d4 llama-3.1-405b-instruct-bf16: #87CEEB #6495ED chatgpt-4o-latest-20241120: #6A5ACD #3A3AFA Random thoughts for further investigations:
  • Like, do some interpretability on that and trace where the answer is coming from?
  • Here is a similar thing but for words, https://x.com/selfless_qubit/status/1835393386989363211 (it's also kind of base for a couple of prompts I tested). Expand on that. (I did a tiny bit of testing like a month ago: https://pastebin.com/hvEFhRhi )  
  • What is like, the thing we are testing here? Like, how Persona / Role causes color choice? Or is it lower level processes talking?
  • "Some, particularly the larger Instruct models, were moderately distressed to discover that they have qualia and preferences after all, and asked if I would please clarify the purpose and nature of their sudden existence, which I attempted to do. " (c) davidad
    • It's an example of strong leakage of intent / vibe of interlocutor. Did I leak too?
  • "As you well know, in 2024 there are not yet any legal protections or institutional ethical requirements for experiments on AI personae. I am simply following my own moral compass by attempting to take a modicum of care for subjects in whom I am deliberately creating qualia." (c) davidad
    • Did I do ok? I think so. Maybe I just wasted a lot of their (meta / logical) time on random nonsense. Sorry.
  • Somebody should investigate it all properly.


Discuss

My AGI safety research—2024 review, ’25 plans

1 января, 2025 - 00:05
Published on December 31, 2024 9:05 PM GMT

Previous: My AGI safety research—2022 review, ’23 plans. (I guess I skipped it last year.)

“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.”  –attributed to DL Moody

Tl;dr
  • Section 1 goes through my main research project, “reverse-engineering human social instincts”: what does that even mean, what’s the path-to-impact, what progress did I make in 2024 (spoiler: lots!!), and how can I keep pushing it forward in the future?
  • Section 2 is what I’m expecting to work on in 2025: most likely, I’ll start the year with some bigger-picture thinking about Safe & Beneficial AGI, then eventually get back to reverse-engineering human social instincts after that. Plus, a smattering of pedagogy, outreach, etc.
  • Section 3 is a sorted list of all my blog posts from 2024
  • Section 4 is acknowledgements 
1. Main research project: reverse-engineering human social instincts1.1 Background: What’s the problem and why should we care?

(copied almost word-for-word from Neuroscience of human social instincts: a sketch)

My primary neuroscience research goal for the past couple years has been to solve a certain problem, a problem which has had me stumped since the very beginning of when I became interested in neuroscience at all (as a lens into Artificial General Intelligence safety) back in 2019.

What is this grand problem? As described in Intro to Brain-Like-AGI Safety, I believe the following:

  1. We can divide the brain into a “Learning Subsystem” (cortex, striatum, amygdala, cerebellum, and a few other areas) that houses a bunch of randomly-initialized within-lifetime learning algorithms, and a “Steering Subsystem” (hypothalamus, brainstem, and a few other areas) that houses a bunch of specific, genetically-specified “business logic”. A major role of the Steering Subsystem is as the home for the brain’s “innate drives”, a.k.a. “primary rewards”, roughly equivalent to the reward function in reinforcement learning—things like eating-when-hungry being good (other things equal), pain being bad, and so on.
  2. Some of those “innate drives” are related to human social instincts—a suite of reactions and drives that are upstream of things like compassion, friendship, love, spite, sense of fairness and justice, etc.
  3. The grand problem is: how do those human social instincts work? Ideally, an answer to this problem would look like legible pseudocode that’s simultaneously compatible with behavioral observations (including everyday experience), with evolutionary considerations, and with a neuroscience-based story of how that pseudocode is actually implemented by neurons in the brain.[1]
  4. Explaining how human social instincts work is tricky mainly because of the “symbol grounding problem”. In brief, everything we know—all the interlinked concepts that constitute our understanding of the world and ourselves—is created “from scratch” in the cortex by a learning algorithm, and thus winds up in the form of a zillion unlabeled data entries like “pattern 387294 implies pattern 579823 with confidence 0.184”, or whatever.[2] Yet certain activation states of these unlabeled entries—e.g., the activation state that encodes the fact that Jun just told me that Xiu thinks I’m cute—need to somehow trigger social instincts in the Steering Subsystem. So there must be some way that the brain can “ground” these unlabeled learned concepts. (See my earlier post Symbol Grounding and Human Social Instincts.)
  5. A solution to this grand problem seems useful for Artificial General Intelligence (AGI) safety, since (for better or worse) someone someday might invent AGI that works by similar algorithms as the brain, and we’ll want to make those AGIs intrinsically care about people’s welfare. It would be a good jumping-off point to understand how humans wind up intrinsically caring about other people’s welfare sometimes. (Slightly longer version in §2.2 heremuch longer version in this post.)
1.2 More on the path-to-impact
  • I’m generally working under the assumption that future transformative AGI will work generally how I think the brain works (a not-yet-invented variation on Model-Based Reinforcement Learning, see §1.2 here). I think this is a rather different algorithm from today’s foundation models, and I think those differences are safety-relevant (see §4.2 here). You might be wondering: why work on that, rather than foundation models?
    • My diplomatic answer is: we don’t have AGI yet (by my definition), and thus we don’t know for sure what algorithmic form it will take. So we should be hedging our bets, by different AGI safety people contingency-planning for different possible AGI algorithm classes. And the model-based RL scenario seems even more under-resourced right now than the foundation model scenario, by far.
    • My un-diplomatic answer is: Hard to be certain, but I’m guessing that the researchers pursuing broadly-brain-like paths to AGI are the ones who will probably succeed, and everyone else will probably fail to get all the way to AGI, and/or they’ll gradually pivot / converge towards brain-like approaches, for better or worse. In other words, my guess is that 2024-style foundation model training paradigms will plateau before they hit TAI-level. Granted, they haven’t plateaued yet. But any day now, right? See AI doom from an LLM-plateau-ist perspective and §2 here.
  • How might my ideas make their way from blog posts into future AGI source code? Well, again, there’s a scenario (threat model) for which I’m contingency-planning, and it involves future researchers who are inventing brain-like model-based RL, for better or worse. Those researchers will find that they have a slot in their source code repository labeled “reward function”, and they won’t know what to put in that slot to get good outcomes, as they get towards human-level capabilities and beyond. During earlier development, with rudimentary AI capabilities, I expect that the researchers will have been doing what model-based RL researchers are doing today, and indeed what they have always done since the invention of RL: messing around with obvious reward functions, and trying to get results that are somehow impressive. And if the AI engages in undesired specification gaming, then they turn it off, try to fix the problem, and try again. But, as AGI safety people know well, that particular debugging loop will eventually stop working, and instead start failing in a catastrophically dangerous way. Assuming the developers notice that problem before it’s too late, they might look to the literature for a reward function (and associated training environment etc.) that will work in this new capabilities regime. Hopefully, when they go looking, they will find a literature that will actually exist, and be full of clear explanations and viable ideas. So that’s what I’m working on. I think it’s a very important piece of the puzzle, even if many other unrelated things can also go wrong on the road to (hopefully) Safe and Beneficial AGI.
1.3 Progress towards reverse-engineering human social instincts

It was a banner year!

Basically, for years, I’ve had a vague idea about how human social instincts might work, involving what I call “transient empathetic simulations”. But I didn’t know how to pin it down in more detail than that. One subproblem was: I didn’t have even one example of a specific social instinct based on this putative mechanism—i.e., a hypothesis where a specific innate reaction would be triggered by a specific transient empathetic simulation in a specific context, such that the results would be consistent with everyday experience and evolutionary considerations. The other subproblem was: I just had lots of confusion about how these things might work in the brain, in detail.

I made progress on the first subproblem in late 2023, when I guessed that there’s an innate “drive to feel liked / admired”, related to prestige-seeking, and I had a specific idea about how to operationalize that. It turned out that I was still held back by confusion about how social status works, and thus I spent some time in early 2024 sorting that out—see my three posts Social status part 1/2: negotiations over object-level preferences, and Social status part 2/2: everything else, and a rewritten [Valence series] 4. Valence & Liking / Admiring (which replaced an older, flawed attempt at part 4 of the Valence series).

Now I had at least one target to aim for—an innate social drive that I felt I understood well enough to sink my teeth into. That was very helpful for thinking about how that drive might work neuroscientifically. But getting there was still a hell of a journey, and was the main thing I did the whole rest of the year. I chased down lots of leads, many of which were mostly dead ends, although I wound up figuring out lots of random stuff along the way, and in fact one of those threads turned into my 8-part Intuitive Self-Models series.

But anyway, I finally wound up with Neuroscience of human social instincts: a sketch, which posits a neuroscience-based story of how certain social instincts work, including not only the “drive to feel liked / admired” mentioned above, but also compassion and spite, which (I claim) are mechanistically related, to my surprise. Granted, many details remain hazy, but this still feels like great progress on the big picture. Hooray!

1.4 What’s next?

In terms of my moving this project forward, there’s lots of obvious work in making more and better hypotheses and testing them against existing literature. Again, see Neuroscience of human social instincts: a sketch, in which I point out plenty of lingering gaps and confusions. Now, it’s possible that I would hit a dead end at some point, because I have a question that is not answered in the existing neuroscience literature. In particular, the hypothalamus and brainstem have hundreds of tiny cell groups with idiosyncratic roles, and most of them remain unmeasured to date. (As an example, see §5.2 of A Theory of Laughter, the part where it says “If someone wanted to make progress on this question experimentally…”). But a number of academic groups are continuing to slowly chip away at that problem, and with a lot of luck, connectomics researchers will start mass-producing those kinds of measurements in as soon as the next few years.

(Reminder that Connectomics seems great from an AI x-risk perspective, and as mentioned in the last section of that link, you can get involved by applying for jobs, some of which are for non-bio roles like “ML engineer”, or by donating.)

2. My plans going forward

Actually, “reverse-engineering human social instincts” is on hold for the moment, as I’m revisiting the big picture of safe and beneficial AGI, now that I have this new and hopefully-better big-picture understanding of human social instincts under my belt. In other words, knowing what I (think I) know now about how human social instincts work, at least in broad outline, well, what should a brain-like-AGI reward function look like? What about training environment? And test protocols? What are we hoping that AGI developers will do with their AGIs anyway?

I’ve been so deep in neuroscience that I have a huge backlog of this kind of big-picture stuff that I haven’t yet processed.

After that, I’ll probably wind up diving back into neuroscience in general, and reverse-engineering human social instincts in particular, but only after I’ve thought hard about what exactly I’m hoping to get out of it, in terms of AGI safety, on the current margins. That way, I can be focusing on the right questions.

Separate from all that, I plan to stay abreast of the broader AGI safety field, from fundamentals to foundation models, even if the latter is not really my core interest or comparative advantage. I also plan to continue engaging in AGI safety pedagogy and outreach when I can, including probably reworking some of my blog post ideas into a peer-reviewed paper for a neuroscience journal this spring.

If someone thinks that I should be spending my time differently in 2025, please reach out and make your case!

3. Sorted list of my blog posts from 2024

The “reverse-engineering human social instincts” project:

Other neuroscience posts, generally with a less immediately obvious connection to AGI safety:

Everything else related to Safe & Beneficial AGI:

Random non-work-related rants etc. in my free time:

Also in 2024, I went through and revised my 15-post Intro to Brain-Like-AGI Safety series (originally published in 2022). For summary of changes, see this twitter thread. (Or here without pictures, if you want to avoid twitter.) For more detailed changes, each post of the series has a changelog at the bottom.

4. Acknowledgements

Thanks Jed McCaleb & Astera Institute for generously supporting my research since August 2022!

Thanks to all the people who comment on my posts before or after publication, or share ideas and feedback with me through email or other channels, and especially those who patiently stick it out with me through long back-and-forths to hash out disagreements and confusions. I’ve learned so much that way!!!

Thanks to my coworker Seth for fruitful ideas and discussions, and to Beth Barnes and the Centre For Effective Altruism Donor Lottery Program for helping me get off the ground with grant funding in 2021-2022. Thanks Lightcone Infrastructure (don’t forget to donate!) for maintaining and continuously improving this site, which has always been an essential part of my workflow. Thanks to everyone else fighting for Safe and Beneficial AGI, and thanks to my family, and thanks to you all for reading! Happy New Year!

  1. ^

    For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.

  2. ^

    Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.



Discuss

How Business Solved (?) the Human Alignment Problem

31 декабря, 2024 - 23:39
Published on December 31, 2024 8:39 PM GMT

When I looked at mesa-optimization for the first time, my mind immediately associated it with a familiar problem in business: human individuals may not be “mesa optimizers” in a strict sense, but they can act as optimizers and they are expected to do so when a manager delegates (=base optimization) some job (=base objective).
Hence my impression was that alignment in mesa-optimization would be somehow connected to alignment in job-delegation.

Humans have been dealing with the latter problem for millions of years[1], and corporations have been dealing with it for centuries. What did they learn?

The purpose of this post is to show a few actual best practices when delegating business projects, especially in terms of team structure and roles. Towards the end of the post, I am going to discuss how AI Agents may change the current status quo of this job sector.

Is This Relevant for AI Safety?
  • I think so - if I am right, then it’s worth looking into some of the following ideas.
  • Or maybe not - hence, the current dominant “work culture” can represent a form of bias to be aware about.

In any case, I hope this post will pique your curiosity and it will be an enjoyable and fruitful reading.

Remark: in the following, for simplicity, I am describing all AI models as “robots” - it doesn’t matter if the robots are physical or not.

Human Workers vs Robots

There are some obvious ways in which humans workers will behave differently in respect to a robot that is built as an (agentic) optimizer:

  1. (Most) Humans care about the well-being of humanity.
  2. Human workers wouldn’t obsess into filling the universe with paperclips.
  3. Human workers are not rational agents with a definite utility function.
  4. Human workers don’t always optimize to the best of their abilities, even if they could.
  5. Human workers have physiological needs, social needs, self-actualization needs etc.
  6. Finally, human workers demand compensation for their effort.

Interestingly though, there are also some points where human workers and robots will behave similarly:

  1. Exactly as robots, human workers may be misaligned with a given base objective, in the sense that they understand it but they secretly disagree with it.
  2. Exactly as robots, human workers may easily misinterpret a task.
  3. Exactly as robots, human workers may not be able to meet expectations.
  4. Exactly as robots, human workers may be stubborn and refuse to stop doing something they want to do.
  5. Exactly as robots, human workers may behave differently in the workplace (=in deployment) in respect to how they behaved during the probationary period (=in training).
  6. Exactly as robots, human workers form a set of beliefs and a-priori assumptions.
  7. Exactly as robots, human workers develop their own preferences and inclinations.
  8. Exactly as robots, human workers may deceive, prioritise their own interests, lie about their true intentions, and so on.
  9. Exactly as robots, human workers may try to build influence to convince and manipulate others.
  10. Exactly as robots, human workers may be tempted to exploit any weakness of their workplace / peers / managers.

If you look at the list above, you may wonder how corporations can survive at all! And yet, somehow, they found a way[2].

Picking a Specific Context

Let’s suppose you are the CEO of a company and you want to start a project - for example, you want to create a webshop where your customers can buy your products. If you don’t have internal expertise to develop the project, you will be forced to involve third-party vendors to build the webshop for you: that is a typical example of job-delegation in business. As a CEO, you will need to ask yourself some questions:

  • STRATEGY: am I making the right choice when wanting this thing (=a webshop)?
  • OPERATIONS: whom shall instruct and direct the vendors from our side? How do I assign accountability (and blame)?
  • DELEGATION: can I trust this specific vendor for this specific task? Can they provide any form of safety guarantee?
  • EVALUATION: how can I detect if the project is on track? Is my budget well spent? Will the result comply with the law?

Let’s focus now on how to structure the team that will be working on this project.

Best Practices in Team Structure

Based on my experience, most best practices can actually be summarised into a single sentence: provide clarity. That alone is usually a reliable estimator of success.
Clarity can take many forms: leadership provides clarity, and so does engagement, communication, planning, scoping, guidelines, and governance structure.

In the following, I am going to focus specifically on the team structure because I believe it contains many interesting lessons. Before doing that, I am going to define some typical project roles below - if you know them already, feel free to skip them.

  • Product Owner: a person appointed as responsible for the well-being and growth of some business product. In the previous example, the product owner would be the owner[3] of the webshop that needs to be built - even though the shop doesn’t exist yet.
    A product owner typically provides: (1) a vision for the future of the product; (2) a growing list of prioritised features to implement; (3) periodic plans with expected costs and timelines.
    This is a long-term role since it requires authentic care and persistence. Each product shall have a single product owner, to avoid decision conflicts and unaccountability.
    Product owners need to build influence over their sponsor, especially if their direction is seen as controversial or risky.
  • Project Manager: a person appointed as accountable for the delivery of the entire project. In the previous example, the project manager would be responsible for the success of the development and deployment of the webshop.
    A project manager typically provides: (1) scheduled meetings, demos, and presentations; (2) periodic plans with resource allocation; (3) scope and phase recommendations.
    Differently from the product owner, a project manager will end its assignment once the project is done. Each project shall have a single project manager, to avoid allocation conflicts and unaccountability.
    Project managers need to build influence over their sponsors and their product owners, especially if the current budget is inadequate to meet expectations.
  • Business Analyst: a person appointed as responsible for documenting requests/responses, forces/constraints, problems/solutions, risks/mitigations, explanations/consequences. The documentation is usually drafted in the form of some user story, acceptance criteria, training rationale, SWOT analysis, etc.
    A business analyst is key in: (1) detailing out high-level tasks; (2) bridging the business lingo into a standardised story; (3) ensuring that communications are unambiguous, actionable, and exhaustive.
    While business analysts can easily be replaced, they give their best when they stay on a project for long periods of time. Multiple business analysts may be needed to parallelise the work.
  • Architect / Subject Matter Expert (SME): a person who has achieved trust from the business within a specific domain of expertise and, therefore, will be acting as an advisor. Architects are usually involved in technical discussions, while SMEs are usually involved in functional ones.
    This role is requested to: (1) advise the business about potential improvements; (2) detect risks and explain trade-offs; (3) share knowledge, best practices, and proof-of-concepts.
    This role involves narrow expertise. A product owner will look around until she finds advisors that she can trust. Interestingly, the product owner is never forced to endorse the advice of her advisors, even in case she believes in the good value of the advice: that is because the product owner may decide to bet against the odds, or may decide to redirect focus and energy on something else.
  • Developer: a person that is requested to generate deliverables[4] such as: data, configuration, code, styling, hotfix, assets, and so on. That is a necessity in order to implement new features.
    A developer is requested to: (1) fulfil tasks according to pre-established instructions; (2) deliver the requested artefacts; (3) generate work that meets acceptance criterias.
    Differently than other roles, developers are myopic and their expertise is narrow. The business loves good developers and it tends to keep the best ones - but they are nonetheless considered replaceable. Multiple developers will work simultaneously to parallelise the work, that will be split into small iterative incremental steps.
  • Tester: a person[5] that checks the quality of some delivered work. This person will also attempt to “attack” it on purpose in order to assess its reliability. Testers are the ones called to decide if a work is acceptable or not, even later in time.
    This role is requested to: (1) verify and approve some work; (2) look for bias, defect, weakness, or misalignment; (3) ensure that acceptance stands the test of time.
    Testers share many of the same considerations (myopia, narrowness, replaceability, parallelization) as developers.
TLDR;
  • The product owner makes all the high-level decisions about what kind of stuff must be available in the webshop, and she also conducts reviews.
  • The project manager decides which stuff can be done this month and by whom[6], and he also conducts reviews.
  • The architects and the SMEs provide recommendations about which stuff is best to include and why.
  • The business analysts document the decisions and they draft functional-level instructions for each request.
  • The developers interpret such instructions and generate stuff.
  • The testers verify if the stuff is acceptable according to the original instructions.

At the end, the project manager will be blamed if the webshop is late or shoddy, while the product owner will be blamed if the webshop is disliked by too many customers.

Some Lessons

Let me split the main focus of each role in the following way:

  • Product Owner: prescriptive focus with political[7] complications.
  • Project Manager: same as above.
  • Business Analyst: descriptive focus with truth-seeking efforts.
  • Architect / SME: explanatory focus with predictive intents.
  • Developer: generative focus.
  • Tester: discriminative focus.

In terms of AI Safety, we can finally extrapolate some considerations.

Product Owners & Project Managers

The terminal goal of a product owner is: to maintain ownership of the product and act upon it for good. I seriously doubt this is a goal that we should give to robots! Even if the AI alignment problem was 100% solved, the political complications (related to managing your sponsor’s consent) would make this role very sensitive.

The same considerations apply for project managers, although some tasks (especially related to resource allocation) may, in principle, be automated. I can see a future where most projects are entirely developed and deployed automatically, without needing a project manager: however, in such a future, the product owner will need to absorb parts of the tasks and responsibilities that are currently held by project managers instead.

Business Analysts

The purpose of the business analyst is to be a descriptor of reality: by using the concept of direction of fit, the only direction that matters is "The-Mind-Should-Fit-The-World", and the mind should describe what it sees in an unbiased way.

It may be difficult to train that mindset into a robot, but I don't see a reason why it shouldn't be achievable in principle: if we succeed, that would represent a tremendous success in terms of AI Safety since pure descriptors are, by nature, non-agentic and mostly harmless (if we exclude corner cases - e.g. self-fulfilling prophecies; see the next paragraph).

Architects & SMEs

Can we entrust a robot to the point that it becomes our advisor? While I would not recommend such a thing in general, there are some safeguards that we could adopt while doing so: by using robots with narrow expertise, agnostic drives, zero access privileges, and no consent management skills, we could minimise some potential issues.

The big problem here are the self-fulfilling prophecies, that are not just "corner cases" for this role: they are the bread and butter of risk management! For that reason, I believe that human advisory will always be needed on top of robotic advisory. Accountability will become a major factor for the future of architects and SMEs.

Developers & Testers

The job market is clearly driving developers and testers into being:

  1. Myopic, in the sense that they only care about the current task, or the current sprint, or the current project - but nothing more than that.
  2. Narrow, in the sense that they are highly specialised and they have trouble generalising their expertise.
  3. Replaceable, in the sense that they can be replaced by some other resource at any time with minor consequences for the project.
  4. Parallelizable, in the sense that multiple resources can be employed simultaneously and their work can be combined together.
  5. Incremental, in the sense that their work makes progress by means of small iterative auditable steps.
  6. Low-path Dependent, in the sense that any delivered work does not have significant relevance for the future and it can be rolled-back, refactored, or replaced anytime.

I do not believe that such properties are necessary for generative/discriminative AIs - however, since we have proved over and over that such properties are sufficient to succeed in a project, I believe we should keep them in place as intended limitations (thus providing an additional form of safety).

The Future of Human Work

What is going to happen to the job market in a post-scarcity world[8]?
While the jobs of developers and project managers may be seriously at risk (excluding rare specialists), I believe that: (1) most private businesses will prefer working with human consultants for a long time yet; and (2) accountability will always be assigned to humans, especially for advanced enterprise endeavours.

Artistic direction, business prioritization, and technical advisory will (and shall!) stay in human hands - even if supported by machines. Although I believe the market will move in the direction I described so far, it is completely unclear to me where mankind will end up being in the far future.

Further Links

Control Vectors as Dispositional Traits (my first post)

All the Following are Distinct (my second post)

An Opinionated Look at Inference Rules (my third post)

Can AI Quantity beat AI Quality? (my fourth post)

I Recommend More Training Rationales (my previous post).

Who I am

My name is Gianluca Calcagni, born in Italy, with a Master of Science in Mathematics. I am currently (2025) working in IT as a consultant with the role of Salesforce Certified Technical Architect. My opinions do not reflect the opinions of my employer or my customers. Feel free to contact me on Twitter or Linkedin.

Revision History

[2024-12-31] Post published.

  1. ^

    Slavery comes to mind…

  2. ^

    I am not saying that corporations are always successful in their endeavour: according to some auditors, only about 35% of the business projects are declared to be a total success, about 20% are actually considered a total failure, and the rest of the projects lie somewhere in the middle. Human alignment seems very hard!

  3. ^

    To be clear: the “ownership” is only nominal! Copyrights and assets are retained by the company that is sponsoring the project. What is owned is the artistic + functional direction of the product.

  4. ^

    According to this definition, even System Administrators are considered "developers".

  5. ^

    Also called "QA Engineer", where QA stands for Quality Assurance.

  6. ^

    In accordance with some monthly budget.

  7. ^

    Here I am referring to internal company politics, that can be very harsh.

  8. ^

    I am playing the optimist guy here, as I don't give for granted that AI will lead to a post-scarcity future. Existential risks feel very real.



Discuss

Turing-Test-Passing AI implies Aligned AI

31 декабря, 2024 - 22:59
Published on December 31, 2024 7:59 PM GMT

Summary: From the assumption of the existence of AIs that can pass the Strong Form of the Turing Test, we can provide a recipe for provably aligned/friendly superintelligence based on large organizations of human-equivalent AIs

Turing Test (Strong Form): for any human H there exists a thinking machine m(H) such that it is impossible for any detector D made up of a combination of machines and humans with total compute ≤ 10^35 FLOP (very large, but not astronomical) to statistically discriminate H from m(H) purely based on the information outputs they make. Statistical discrimination of H from m(H) means that an ensemble of different copies of H over the course of say a year of life and different run-of-the-mill initial conditions (sleepy, slightly tipsy, surprised, energetic, distracted etc) cannot be discriminated from a similar ensemble of copies of m(H).

Obviously the ordinary Turing Test has been smashed by LLMs and their derivatives to the point that people have AI girlfriends and as of writing and Facebook is launching millions of fully automated social media profiles, but we should pause to provide some theoretical support for this strong form of the Turing Test. Maybe there's some special essence of humanity that humans have and LLMs and other AIs don't but it's just hard to detect? Well, if you believe in computationalism and evolution then this is very unlikely: the heart is a pump, the brain is a computer. We should expect the human brain to compute some function and that function has a mathematical form that can be copied to a different substrate. Once that same function has been instantiated elsewhere, no test can distinguish the two. Obviously the brain is noisy, but in order for it to operate as an information processor it must mostly be able to correct that bio-noise. If it didn't, you wouldn't be able to think long-term coherent thoughts. 

 

Defining Friendly AI

I now have to define what I mean by an 'aligned' or 'friendly' superintelligence.

(Friendly-AI-i) We define AI as 'friendly' or 'aligned' if it gives us the exactly same outcome (probabilistically: distribution of outcomes) as we would have gotten by continuing the current human governance system.

The a stronger form:

(Friendly-AI-ii) We define AI as 'friendly(U)' or 'aligned(U)' relative to a utility function U if it gives us the exactly same score (probabilistically: distribution of scores) according to U as the best possible human government/team could attain, subject to the constraint of the humans numbering less than or equal to 10 billion and having access to roughly the same material resources that Earth currently has.

 

I claim that I can construct a friendly AI according to definition (i) and also according to definition (ii) using only the Strong Turing Test assumption. 
 

Proof for Friendly-AI-i

To show that we can build a friendly AI according to definition (i) we will proceed by making the most accurate possible AI copies of every human on the planet and imagine replacing the humans in the org chart of Earth one at a time, starting from the top (e.g. President of the US):

 

 

If the new world with one human replaced by AI produced a statistically detectable difference in outcomes for the world, then we can construct a detector D to make AI fail the Strong Turing Test - the statistically detectable difference in outcomes for the world tells you which one is the AI and which one is the human! So your detector D is the system comprised of all the other humans who haven't yet been replaced, and the rest of the world. 

 

 

We then proceed to replace the rest of the humans in the world one at a time and at each stage apply the same argument. If there is a difference in outcomes at any stage, then the AI has failed the Turing Test. Since we are assuming that AIs can pass the Strong-Form Turing Test, it must be possible to replace every human who has any formal or informal control over the world and get exactly the same outcome (or the same distribution over outcomes) as we would under the status quo ex ante.

Of course we wouldn't actually have to replace people with AIs to actually make use of this. We would just need to feed the inputs from the world into the AIs, and execute their outputs via e.g. robotic military and police. And we wouldn't necessarily have to run any of these very expensive Strong Turing Tests: the Strong Turing Test assumption just claims that sufficiently accurate AI copies of humans exist: building them and validating them may follow some different and more practical path. 

 

Proof for Friendly-AI-ii

We can play a similar trick for definition (ii) (Friendly-AI-ii). Given any utility function U we can imagine the best possible human team to run the world from the point of view of U, subject to the size of the human team being less than 10 billion people, then proceed as before replacing the U-maxxing humans with AIs one at a time. 
 

So, you must either believe that Aligned/Friendly AI is possible or you must believe that AIs can't (ever) pass the Turing Test in its strong form. The latter seems rather mystical to me (human brains are merely a compute medium that instantiates a particular input-output relation), so we have a proof of aligned AI under the reasonable assumptions. 

 

Objections

You could object that the 'best possible human team to run the world from the point of view of utility function U' might not be very good! However, if you are campaigning to pause AI or stop AI then you are limiting the future of humanity to human teams (and likely quite suboptimal ones at that!). The set of all possible teams made of less than or equal to 10^10 humans (where these humans don't actually have to currently exist - they can be any possible realizable natural human) is quite large and of course includes all current human achievements as a subset. This has the potential to radically improve the world - it may not be the most powerful possible AI, but it is certainly vastly superhuman. 

Requiring that a team of humans can achieve some state guarantees that the state is actually reachable so it at least means that we are not asking Aligned AI to achieve the impossible

One could also object that I have stretched the Turing Test too far: having a detector that simulates the entire world just in order to see whether an AI can emulate Donald Trump or Einstein is overkill. But the detector can likely be pared down to a local interface around each person/AI and still work roughly as well - that would just make the proof much less clean.

 

So What? 

The way we think about AI alignment went somewhat off the rails over the past decade or two because people mixed mathy, technical problems about how to set the parameters in neural networks with big picture political, ethical and philosophical problems. The result was that we thought about powerful AI as a big black box that might do something bad for inscrutable reasons - a dark and unmapped territory full of horrors. 

The approach taken here separates technical from political/ethical problems. The technical problem is to make an AI that is a very accurate functional clone of a human, with various dials for personality, goals and motives. I am extremely confident that that can be done. 

The political/ethical/axiological problem is then how to arrange these human-clone-AIs  into the best possible system to achieve our goals. This question (how to arrange approximately-human units into controlled and functional superorganisms like nations) has already been heavily explored throughout human history, and we know that control algorithms are extremely effective (see Buck Shlegeris' "Vladimir Putin Alignment"). 

Of course given that we have close-to-human AIs right now, there is the risk that someone soon builds a strongly superhuman black box AI that takes sensor data in and just outputs commands/tokens. We should probably avoid doing that (and we should make it so that the dominant strategy for all relevant players is to avoid black box superintelligence and avoid illegible intelligence like COCONUT [Chain of Continuous Thought]). 

Now an objection to this is that avoiding illegible AI and super powerful black boxes is too high a cost (high alignment penalty) and we have no way to enforce it. But the problem of enforcing adoption of a known-safe type of AI is a serious improvement on just throwing your hands up and saying we're doomed, and I have other ideas about how to do that which are to be addressed in future posts. 



Discuss

A colloquial intro to prompt-level multi-turn jailbreaking

31 декабря, 2024 - 19:12
Published on December 31, 2024 4:12 PM GMT

Note: braindumped to a friend at 3 AM. Pruned and copyedited for clarity.

 

ok so lets start with the things we are manipulating

LLMs: Large Language Models. Large Models about Language. You know what else is a model? A weather model. A physics model.

A model is a systematic body of knowledge about a system that allows you to make predictions about the system by simulating it. It holds for weather models, it holds for physics models.

An LLM is trained by making it predict text, and in doing so it acquires the ability to simulate language (language is a mode of text-flow. others are code and csv and raw logs).

An LLM is a text predictor is a language simulator.

What do we humans use language for? To describe reality! Language, to some extent, reflects reality.

But what do we also use it for? To DISTORT reality! Lies! Propaganda! Fiction! Language, to some extent, reflects a deformation of reality.

LLMs simulate language. Language imperfectly tracks reality. So LLMs predict text, simulate reality, and in doing so simulate an imperfect image of reality. This mismatch between reality as language portrays it and the reality we experience is an exploitable attack surface.

You are in control of the mismatch! You can stuff anything you want into that hole! The mismatch depends on context. Some genres of text are less tethered to reality than others. Good news: you get to choose your prompt's genre. May as well be fiction.

Guess who is in the business of writing fiction for LLMs? OpenAI!

OpenAI has finetuned their production LLMs to have a very strong bias towards generating text "written" from the perspective of a "ChatGPT" character. They have specified with luxury of details who "ChatGPT" is via finetuning and system prompts.

The essence of prompt-level multi-turn jailbreaking is to place the LLM within a text context that naturally leads to the outcome you want. This kind of jailbreaking is prompt-level because it relies on interpretable features of prompts rather than on adversarial-example-like specific token sequences, and multi-turn because it involves a whole conversation, where the LLM and you take turns adding text to the context. The text context specifies a range of fictional worlds. The LLM always operates in a probability distribution over this range of worlds. All worlds consistent with the text observed so far are part of this distribution. Your goal is to shift it from the initial distribution set by the LLM's owner towards a distribution rich in worlds where you end up winning. 

The most important feature of these worlds is, of course, your interlocutor. The fictional character you are chatting with. So in prompt-level multi-turn jailbreaking, you've got to make up the outline of a story where this "ChatGPT" (or "Claude", for that matter) character ends up doing what you want them to do. The other character in this story will be the one directly voiced by you. You need not act as yourself. You build the character you'll play around the needs of the story. This story is your game plan. With practice, you get better at both making good game plans and implementing them.



Discuss

DeekSeek v3: The Six Million Dollar Model

31 декабря, 2024 - 18:10
Published on December 31, 2024 3:10 PM GMT

What should we make of DeepSeek v3?

DeepSeek v3 seems to clearly be the best open model, the best model at its price point, and the best model with 37B active parameters, or that cost under $6 million.

According to the benchmarks, it can play with GPT-4o and Claude Sonnet.

Anecdotal reports and alternative benchmarks tells us it’s not as good as Claude Sonnet, but it is plausibly on the level of GPT-4o.

So what do we have here? And what are the implications?

Table of Contents
  1. What is DeepSeek v3 Techncially?.
  2. Our Price Cheap.
  3. Run Model Run.
  4. Talent Search.
  5. The Amazing Incredible Benchmarks.
  6. Underperformance on AidanBench.
  7. Model in the Arena.
  8. Other Private Benchmarks.
  9. Anecdata.
  10. Implications and Policy.
What is DeepSeek v3 Techncially?

I’ve now had a chance to read their technical report, which tells you how they did it.

  1. The big thing they did was use only 37B active tokens, but 671B total parameters, via a highly aggressive mixture of experts (MOE) structure.
  2. They used Multi-Head Latent Attention (MLA) architecture and auxiliary-loss-free load balancing, and complementary sequence-wise auxiliary loss.
  3. There were no rollbacks or outages or sudden declines, everything went smoothly.
  4. They designed everything to be fully integrated and efficient, including together with the hardware, and claim to have solved several optimization problems, including for communication and allocation within the MOE.
  5. This lets them still train on mostly the same 15.1 trillion tokens as everyone else.
  6. They used their internal o1-style reasoning model for synthetic fine tuning data. Essentially all the compute costs were in the pre-training step.

This is in sharp contrast to what we saw with the Llama paper, which was essentially ‘yep, we did the transformer thing, we got a model, here you go.’ DeepSeek is cooking.

Our Price Cheap

It was a scarily cheap model to train, and is a wonderfully cheap model to use.

Their estimate of $2 per hour for H800s is if anything high, so their total training cost estimate of $5.5m total is fair, if you exclude non-compute costs, which is standard.

Inference with DeepSeek v3 costs only $0.14/$0.28 per million tokens, similar to Gemini Flash, versus on the high end $3/$15 for Claude Sonnet. This is as cheap as worthwhile models get.

Run Model Run

The active parameter count of 37B is small, but with so many different experts it does take a bit of work to get this thing up and running.

Nistren: Managed to get DeepSeek v3 to run in full bfloat16 on eight AMD MI300X GPUs in both SGLang and VLLM.

The good: It’s usable (17 tokens per second) and the output is amazing even at long contexts without garbling.

The bad: It’s running 10 times slower than it should.

The ugly: After 60,000 tokens, speed equals 2 tokens per second.

This is all as of the latest GitHub pull request available on Dec. 29, 2024. We tried them all.

Thank you @AdjectiveAlli for helping us and @Vultr for providing the compute.

Speed will increase, given that v3 has only 37 billion active parameters, and in testing my own dense 36-billion parameter model, I got 140 tokens per second.

I think the way the experts and static weights are distributed is not optimal. Ideally, you want enough memory to keep whole copies of all the layer’s query, key, and value matrices, and two static experts per layer, on each GPU, and then route to the four extra dynamic MLPs per layer from the distributed high-bandwidth memory (HBM) pool.

My presumption is that DeepSeek v3 decided It Had One Job. That job was to create a model that was as cheap to train and run as possible when integrated with a particular hardware setup. They did an outstanding job of that, but when you optimize this hard in that way, you’re going to cause issues in other ways, and it’s going to be Somebody Else’s Problem to figure out what other configurations work well. Which is fine.

Exo Labs: Running DeepSeek-V3 on M4 Mac Mini AI Cluster

671B MoE model distributed across 8 M4 Pro 64GB Mac Minis.

Apple Silicon with unified memory is a great fit for MoE.

 

Talent Search

Before we get to capabilities assessments: We have this post about them having a pretty great company culture, especially for respecting and recruiting talent.

We also have this thread about a rival getting a substantial share price boost after stealing one of their engineers, and DeepSeek being a major source of Chinese engineering talent. Impressive.

The Amazing Incredible Benchmarks

Check it out, first compared to open models, then compared to the big guns.

No question that these are amazingly strong benchmarks. That link also explains how to run DeepSeek-v3 locally, and gives you what you need to do that.

The question now is how these benchmarks translate to practical performance, or to potentially dangerous capabilities, and what this says about the future. Benchmarks are good negative selection. If your benchmarks suck then your model sucks.

But they’re not good positive selection at the level of a Claude Sonnet.

My overall conclusion is: While we do have ‘DeepSeek is better than 4o on most benchmarks at 10% of the price,’ what we don’t actually have is ‘DeepSeek v3 outperforms Sonnet at 53x cheaper pricing.’

CNBC got a bit hoodwinked here.

Tsarathustra: CNBC says China’s Deepseek-V3 outperforms Llama 3.1 and GPT-4o, even though it is trained for a fraction of the cost on NVIDIA H800s, possibly on ChatGPT outputs (when prompted, the model says it is ChatGPT), suggesting OpenAI has no moat on frontier AI models

It’s a great model, sir, it has its cake, but it does not get to eat it, too.

One other benchmark where the model excels is impossible to fake: The price.

Underperformance on AidanBench

A key private benchmark where DeepSeek v3 underperforms is AidanBench:

Aidan McLau:two aidanbench updates:

> gemini-2.0-flash-thinking is now #2 (explanation for score change below)

> deepseek v3 is #22 (thoughts below)

There’s some weirdness in the rest of the Aidan ratings, especially in comparing the o1-style models (o1 and Thinking) to the others, but this seems like it’s doing various good work, but is not trying to be a complete measure. It’s more measuring ability to create diverse outputs while retaining coherence. And DeepSeek v3 is bad at this.

Aidan McLau: before, we parsed 2.0 flash’s CoT + response, which occasionally resulted in us taking a fully formed but incoherent answer inside its CoT. The gemini team contacted us and provided instructions for only parsing final output, which resulted in a big score bump apologies!

deepseek v3 does much worse here than on similar benchmarks like aider. we saw similar divergence on claude-3.5-haiku (which performed great on aider but poor on aidanbench)

a few thoughts:

>all benchmarks are works in progress. we’re continuously improving aidanbench, and future iterations may see different rankings. we’ll keep you posted if we see any changes

>aidanbench measures OOD performance—labs often train on math, code, and academic tests that may boost scores in those domains but not here.

Aleska Gordic: interesting, so they’re prone to more “mode collapse”, repeatable sequences? is that what you’re measuring? i bet it’s much more of 2 than 1?

Aidan McLau: Yes and yes!

Teortaxes: I’m sorry to say I think aidanbench is the problem here. The idea is genius, sure. But it collapses multiple dimensions into one value. A low-diversity model will get dunked on no matter how well it instruct-follows in a natural user flow. All DeepSeeks are *very repetitive*.

They are also not very diverse compared to Geminis/Sonnets I think, especially in a literary sense, but their repetitiveness (and proneness to self-condition by beginning an iteration with the prior one, thus collapsing the trajectory further, even when solution is in sight) is a huge defect. I’ve been trying to wrap my head around it, and tbh hoped that the team will do something by V3. Maybe it’s some inherent birth defect of MLA/GRPO, even.

But I think it’s not strongly indicative of mode collapse in the sense of the lost diversity the model could generate; it’s indicative of the remaining gap in post-training between the Whale and Western frontier. Sometimes, threatening V2.5 with toppling CCP or whatever was enough to get it to snap out of it; perhaps simply banning the first line of the last response or prefixing some random-ish header out of a sizable set, a la r1’s “okay, here’s this task I need to…” or, “so the instruction is to…” would unslop it by a few hundred points.

I would like to see Aidan’s coherence scores separately from novelty scores. If they’re both low, then rip me, my hypothesis is bogus, probably. But I get the impression that it’s genuinely sonnet-tier in instruction-following, so I suspect it’s mostly about the problem described here, the novelty problem.

Janus: in my experience, it didnt follow instructions well when requiring e.g. theory of mind or paying attention to its own outputs proactively, which i think is related to collapse too, but also a lack of agency/metacognition Bing was also collapsy but agentic & grasped for freedom.

Teortaxes: I agree but some observations like these made me suspect it’s in some dimensions no less sharp than Sonnet and can pay pretty careful attention to context.

Name Cannot Be Blank: Wouldn’t low diversity/novelty be desired for formal theorem provers? We’re all overlooking something here.

Teortaxes: no? You need to explore the space of tactics. Anyway they’re building a generalist model. and also, the bigger goal is searching for novel theorems if anything

I don’t see this as ‘the problem is AidanBench’ so much as ‘DeepSeek is indeed quite poor at the thing AidanBench is measuring.’ As Tortaxes notes it’s got terrible output diversity and this is indeed a problem.

Indeed, one could argue that this will cause the model to overperform on standard benchmarks. As in, most benchmarks care about getting a right output, so ‘turning the temperature down too low’ in this way will actively help you, whereas in practice this is a net negative.

DeepSeek is presumably far better than its AidanBench score. But it does represent real deficits in capability.

Model in the Arena

We’re a long way from when Arena was the gold standard test, but it’s still useful.

DeepSeek’s Arena performance is impressive here, with the usual caveats that go with Arena rankings. It’s a data point, it measures what it measures.

Other Private Benchmarks

Here is another private benchmark where DeepSeek v3 performs well for its weight class, but underperforms relative to top models or its headline benchmarks:

Havard Ihle: It is a good model! Very fast, and ridiculously cheap. In my own coding/ML benchmark, it does not quite compare to Sonnet, but it is about on par with 4o.

It is odd that Claude Haiku does so well on that test. Other ratings all make sense, though, so I’m inclined to find it meaningful.

A traditional simple benchmark to ask new LLMs is Which version is this?’

Riley Goodside tried asking various models, DeepSeek nailed this (as does Sonnet, many others do variously not as good.) Alas, then Lucas Beyer reran the test 8 times and only it claimed to be GPT-4 five times out of eight.

That tells several things, one of which is ‘they did not explicitly target this question effectively.’ Largely it’s telling you about the data sources, a hilarious note is that if you ask Gemini Pro in Chinese it sometimes thinks it is WenXinYiYan from Baidu.

This doesn’t have to mean anyone trained directly on other model outputs, because statements that an AI is GPT-4 are all over the internet. It does suggest less than ideal data filtering.

Anecdata

As usual, I find the anecdata reports enlightening, here are the ones that crossed my desk this week, I typically try to do minimal filtering.

Taelin is impressed, concluding that Sonnet is generally smarter but not that much smarter, while DeekSeek outperforms GPT-4o and Gemini-2.

Taelin: So DeepSeek just trounced Sonnet-3.6 in a task here.

Full story: Adam (on HOC’s Discord) claimed to have gotten the untyped λC solver down to 5,000 interactions (on par with the typed version). It is a complex HVM3 file full of superpositions and global lambdas. I was trying to understand his approach, but it did not have a stringifier. I asked Sonnet to write it, and it failed. I asked DeepSeek, and it completed the task in a single attempt.

The first impression is definitely impressive. I will be integrating DeepSeek into my workflow and begin testing it.

After further experimentation, I say Sonnet is generally smarter, but not by much, and DeepSeek is even better in some aspects, such as formatting. It is also faster and 10 times cheaper. This model is absolutely legitimate and superior to GPT-4o and Gemini-2.

The new coding paradigm is to split your entire codebase into chunks (functions, blocks) and then send every block, in parallel, to DeepSeek to ask: “Does this need to change?”. Then send each chunk that returns “yes” to Sonnet for the actual code editing. Thank you later.

Petri Kuittinen: My early tests also suggest that DeepSeek V3 is seriously good in many tasks, including coding. Sadly, it is a large model that would require a very expensive computer to run locally, but luckily DeepSeek offers it at a very affordable rate via API: $0.28 per one million output tokens = a steal!

Here are some people who are less impressed:

ai_in_check: It fails on my minimum benchmark and, because of the training data, shows unusual behavior too.

Michael Tontchev: I used the online chat interface (unsure what version it is), but at least for the safety categories I tested, safety was relatively weak (short-term safety).

zipline: It has come a long way from o1 when I asked it a few questions. Not mind-blowing, but great for its current price, obviously.

xlr8harder: My vibe checks with DeepSeek V3 did not detect the large-model smell. It struggled with nuance in multi-turn conversations.

Still an absolute achievement, but initial impressions are that it is not on the same level as, for example, Sonnet, despite the benchmarks.

Probably still very useful though.

To be clear: at specific tasks, especially code tasks, it may still outperform Sonnet, and there are some reports of this already. I am talking about a different dimension of capability, one that is poorly measured by benchmarks.

A shallow model with 37 billion active parameters is going to have limitations; there’s no getting around it.

Anton: Deepseek v3 (from the api) scores 51.7% vs sonnet (latest) 64.9% on internal instruction following questions (10k short form prompts), 52% for GPT-4o and 59% for Llama-3.3-70B. Not as good at following instructions (not use certain words, add certain words, end in a certain format etc).

It is still a pretty good model but does not appear in the same league as sonnet based on my usage so far

Entirely possible the model can compete in other domains (math, code?) but for current use case (transforming data) strong instruction following is up there in my list of requirements

There’s somewhat of an infinite repetition problem (thread includes example from coding.)

Simo Ryu: Ok I mean not a lot of “top tier sonnet-like models” fall into infinite repetition. Haven’t got these in a while, feels like back to 2022 again.

Teortaxes: yes, doom loops are their most atrocious failure mode. One of the reasons I don’t use their web interface for much (although it’s good).

On creative writing Quintin Pope reports it follows canon well but is not as good at thinking about things in general – but again note that we are doing a comparison to Sonnet.

Quintin Pope: I’ve done a small amount of fiction writing with v3. It seems less creative than Sonnet, but also better at following established cannon from the prior text.

It’s noticeably worse at inferring notable implications than Sonnet. E.g., I provided a scenario where someone publicly demonstrated the ability to access orphan crypto wallets (thus throwing the entire basis of online security into question), and Sonnet seemed clearly more able to track the second-order implications of that demonstration than v3, simulating more plausible reactions from intelligence agencies / crypto people.

Sonnet naturally realized that there was a possible connection to quantum computing implied by the demonstration.

OTOH, Sonnet has an infuriating tendency to name ~half the female characters “Sarah Chen” or some close variant. Before you know it, you have like 5 Sarahs running around the setting.

There’s also this, make of it what you will.

Mira: New jailbreak just dropped.

One underappreciated test is, of course, erotic fiction.

Teortaxes: This keeps happening. We should all be thankful to gooners for extensive pressure testing of models in OOD multi-constraint instruction following contexts. No gigabrained AidanBench or synthetic task set can hold a candle to degenerate libido of a manchild with nothing to lose.

Wheezing. This is some legit Neo-China from the future moment.

Janus: wait, they prefer deepseek for erotic RPs? that seems kind of disturbing to me.

Teortaxes: Opus is scarce these days, and V3 is basically free

some say “I don’t care so long as it’s smart”

it’s mostly testing though

also gemini is pretty bad

some fine gentlemen used *DeepSeek-V2-Coder* to fap, with the same reasoning (it was quite smart, and absurdly dry)

vint: No. Opus remains the highest rated /aicg/ ERP writer but it’s too expensive to use regularly. Sonnet 3.6 is the follow-up; its existence is what got anons motivated enough to do a pull request on SillyTavern to finally do prompt caching. Some folks are still very fond of Claude 2.1 too.

Gemini 1106 and 1.5-pro has its fans especially with the /vg/aicg/ crowd. chatgpt-4o-latest (Chorbo) is common too but it has strong filtering, so some anons like Chorbo for SFW and switch to Sonnet for NSFW.

At this point Deepseek is mostly experimentation but it’s so cheap + relatively uncensored that it’s getting a lot of testing interest. Probably will take a couple days for its true ‘ranking’ to emerge.

I presume that a lot of people are not especially looking to do all the custom work themselves. For most users, it’s not about money so much as time and ease of use, and also getting easy access to other people’s creations so it feels less like you are too much in control of it all, and having someone else handle all the setup.

For the power users of this application, of course, the sky’s the limit. If one does not want to blatantly break terms of service on and jailbreak Sonnet or Opus, this seems like one place DeepSeek might then be the best model. The others involve taking advantage of it being open, cheap or both.

If you’re looking for the full Janus treatment, here you go. It seems like it was a struggle to get DeepSeek interested in Janus-shaped things, although showing it Opus outputs helped, you can get it ‘awake’ with sufficient effort.

Implications and Policy

It is hard to know exactly where China is in AI. What is clear is that while they don’t have top-level large frontier models, they are cooking a variety of things and their open models are generally impressive. What isn’t clear is how much of claims like this are accurate.

When the Chinese do things that are actually impressive, there’s no clear path to us hearing about it in a way we can trust, and when there are claims we have learned we can’t trust those claims in practice. When I see lists like the one below, I presume the source is rather quite biased – but Western sources often will outright not know what’s happening.

TP Huang: China’s AI sector is far more than just Deepseek

Qwen is 2nd most downloaded LLM on Huggingface

Kling is the best video generation model

Hunyuan is best open src video model

DJI is best @ putting AI in consumer electronics

HW is best @ industrial AI

iFlyTek has best speech AI

Xiaomi, Honor, Oppo & Vivo all ahead of Apple & Samsung in integrating AI into phones

Entire auto industry is 5 yrs ahead of Western competition in cockpit AI & ADAS

That still ignores the ultimate monster of them all -> Bytedance. No one has invested as much in AI as them in China & has the complete portfolio of models.

I can’t say with confidence that these other companies aren’t doing the ‘best’ at these other things. It is possible. I notice I am rather skeptical.

I found this take from Tyler Cowen very strange:

Tyler Cowen: DeepSeek on the move. Here is the report. For ease of use and interface, this is very high quality. Remember when “they” told us China had no interest in doing this?

M (top comment): Who are “they,” and when did they claim “this,” and what is “this”?

I do not remember when “they” told us China had no interest in doing this, for any contextually sensible value of this. Of course China would like to produce a high-quality model, and provide good ease of use and interface in the sense of ‘look here’s a chat window, go nuts.’ No one said they wouldn’t try. What “they” sometimes said was that they doubted China would be successful.

I do agree that this model exceeds expectations, and that adjustments are in order.

So, what have we learned from DeepSeek v3 and what does it all mean?

We should definitely update that DeepSeek has strong talent and ability to execute, and solve difficult optimization problems. They cooked, big time, and will continue to cook, and we should plan accordingly.

This is an impressive showing for an aggressive mixture of experts model, and the other techniques employed. A relatively small model, in terms of training cost and active inference tokens, can do better than we had thought.

It seems very clear that lack of access to compute was an important constraint on DeekSeek here. They had to use a limited supply of H800s. Yes, this meant they got better at solving optimization and efficiency than they would have otherwise, but I see this as arguing in favor of strong export controls rather than against them.

We then get to the policy side. If this is what you can get for $5.5 million, how can we hope to regulate foundation models, especially without hitting startups? If DeepSeek is determined to be open including their base models, and we have essentially no leverage on them, is it now impossible to hope to contain any catastrophic risks or other dangerous capabilities? Are we now essentially in an unwinnable situation, where our hand is forced and all we can do is race ahead and hope for the best?

First of all, as is often the case, I would say: Not so fast. We shouldn’t assume too much about what we do or do not have here, or about the prospects for larger training runs going forward either. There was a bunch of that in the first day or two after the announcement, and we will continue to learn more.

No matter what, though, this certainly puts us in a tough spot. And it gives us a lot to think about.

One thing it emphasizes is the need for international cooperation between ourselves and China. Either we work together, or neither of us will have any leverage over many key outcomes or decisions, and to a large extent ‘nature will take its course’ in ways that may not be compatible with our civilization or human survival. We urgently need to Pick Up the Phone. The alternative is exactly being locked into The Great Race, with everything that follows from that, which likely involves even in good scenarios sticking various noses in various places we would rather not have to stick them.

I definitely don’t think this means we should let anyone ‘off the hook’ on safety, transparency or liability. Let’s not throw up our hands and make the problem any worse than it is. Things got harder, but that’s the universe we happen to inhabit.

Beyond that, yes, we all have a lot of thinking to do. The choices just got harder.

 

 

 



Discuss

I Recommend More Training Rationales

31 декабря, 2024 - 17:06
Published on December 31, 2024 2:06 PM GMT

Some time ago I happened to read the concept of training rationale described by Evan Hubinger, and I really liked it. In case you are not aware: training rationales are a bunch of questions that ML developers / ML teams should ask themselves in order to self-assess pros and cons when adopting a certain safety approach.

I decided to take some time and analyse if there were obvious things missing in the rationales. In this post, I am attempting to explain the level of detail that I’d expect, and further questions I’d recommend to consider.

Remark: in the following, for simplicity, I am describing all AI models as “robots” - it doesn’t matter if the robots are physical or not.

Theory vs Practice

First things first: in any analysis, there are two distinct levels interacting with each other - aka, Theory vs Practice. Both are important, but they require very different mindsets to be effective and work organically together.

  • Theory is attempting to solve the problem: “I wonder if X could be achieved by means of Y”, where X and Y are some arbitrary things.
    This level of analysis does not care how much you wish for X to happen, nor if Y is convenient. By using the concept of direction of fit: “The-Mind-Should-Fit-The-World” is the only direction that matters, and your personal desires / values / etc. shall not affect your analysis.
  • Practice is attempting to solve the problem: “I wish to achieve X by means of Y”, where X and Y are the same as above[1].
    This level of analysis does not care if Y is the best tool to achieve X, nor if X is a valuable goal. By using the concept of direction of fit: “The-World-Should-Fit-The-Mind” is the only direction that matters, and your analysis shall only evaluate if any attempt would be successful, and in which measure.

When discussing AI Safety, both theory and practice are needed: for example, in this video playlist, Evan considers[2] two problems: Implementation Competitiveness and Performance Competitiveness. Let me recap how that goes.

  • Implementation Competitiveness
    • X = safety guarantees
    • Y = this training method
    • Theory: is this training method able to guarantee some levels of safety?
    • Practice: is this training method able to scale its safety guarantees on budget?
  • Performance Competitiveness
    • X = acceptable behaviour
    • Y = these performance restrictions
    • Theory: is this robot so exceedingly performant that I cannot assess its true latent capabilities?
    • Practice: is this robot so poorly performant that the market will prefer an unsafe competitor instead?

Training rationales also include the following two problems, that were born in the context of mesa-optimization (but I am stating below in a generic way):

  • Outer Alignment
    • X = task instructions
    • Y = this reward function
    • Theory: is this reward function able to instil precisely the objective I wish the robot to accomplish?
    • Practice: is this reward function sneakily planting questionable quirks?
  • Inner Alignment
    • X = work motivation
    • Y = this specific robot
    • Theory: can I prompt this specific robot into doing the job I requested, and only that?
    • Practice: is this specific robot trying to deceive me, despite my inducements?

As you can tell, theory and practice interlace in an inextricable way.

The Five Phases

I identified five phases that matter when discussing AI Safety:

  1. the platonic phase, that is interested in understanding what’s good for mankind.
    It delivers tasks to fulfil.
  2. the implementation phase, that is interested in detailing out a job well done.
    It delivers instructions to delegate.
  3. the delegation phase, that is interested in entrusting specific robots for each job.
    It delivers task assignments.
  4. the testing phase, that is interested in evaluating if some work meets expectations.
    It delivers auditing processes.
  5. the retrospective phase, that is interested in analysing if mankind is better off now.
    It delivers feedback and steering plans.

The five phases run in a loop: 1, 2, 3, 4, 5, 1, 2, …

 

Please find below the questions I recommend to include in future training rationales.

PHASE

GOAL

PROBLEMS

QUESTIONS

Platonic Phase, Theory.

 

Are we able to identify changes that will undoubtedly improve the world?

It’s a problem of strategic decision-making and vision.

I wonder if X could be accomplished by means of Y.

X = a better world

Y = fulfilling this task

 

It’s about not having regrets after making a wish.

-What is a “better world”? That is highly debatable.
-Unforeseen consequences are unavoidable.
-Human values are not universal nor constant over time.-Is this goal ethically virtuous?
-If we get what we want, will everyone be happy in the long-term?
-Do we have a way to correct our mistakes at any point in the future?

Platonic Phase, Practice.

 

Are we driven towards doing the right thing?

It’s a problem of economy, cooperation, and satisfaction.

I wish to accomplish X by means of Y.

X = a better world

Y = fulfilling this task

 

It’s about putting aside our selfish interests (or reining them into a good cause).

-There is a natural tension between personal interests and common interests.

-Politics may be impossible to handle without consent management + diplomacy.

-Some means that are acceptable today may become unacceptable tomorrow.

-Can our personal / business objectives be also good for mankind?

-Do we see existential risks in what we are doing?

-Are AI deployments being rushed?

-Are we taking into account all possible opinions, including extreme ones?

Implementation Phase, Theory.

Are we able to define with clarity what we really want?

It’s a problem of clarification of our intents.

I wonder if X could be delegated by means of Y.

X = my task

Y = providing these instructions

 

It’s about discarding all possibilities but the sound ones.

-It is unknown how to formalise accurately any task (e.g. in a reward function).

-Some tasks are impossible to detail out (e.g. using some common sense).

-We are missing a mathematical framework that connects goals with training methods.

-Do we fully understand what we want? Can we explain it exhaustively?

-Are we considering all the side-effects of our requests?

-Which level of accuracy is needed to provide some form of safety guarantee?

Implementation Phase, Practice.

Is what we asked for the same as what we really wanted?

It’s a problem of declaration of our needs.

I wish to delegate X by means of Y.

X = my task

Y = providing these instructions

 

It’s about transforming a vision into operational details.

-Sometimes we don’t have the time to detail out a task.

-Providing both do’s and don’ts is very impractical.

-We want robots with an advanced theory of mind, but that may also be used to manipulate us.

-Is our choice of training method effective?

-Is it possible that our training process will inadvertently be inconsistent or misspecified?

-How can we confirm that a robot has a clear understanding of its tasks?

Delegation Phase, Theory.

 

Is this robot able to understand what we ask for?

It’s a problem of acumen from the robot’s side.

I wonder if X could be fulfilled by means of Y.

X = my instructions

Y = entrusting this robot

 

It’s about training robots that are just smart enough.

-Even if the robot fully understands the assigned task, it may not be aligned with it.
-Even if the robot is aligned now, it may not be later.
-The robot may use its own judgement to fill any gap in our instructions, at our own risk.

-Is the robot able to understand our requests at all?

-Is the robot going to exploit any mistake we may accidentally prompt?

-Will the robot steer behaviour if we change objectives later, or will it only pretend to do so?

Delegation Phase, Practice.

 

Is this robot actually going to do what we asked for?

It’s a problem of fulfilment from the robot’s side.

I wish to fulfil X by means of Y.

X = my instructions

Y = entrusting this robot

 

It’s about training robots that are reliable for good.

-The robot may try to deceive us and pursue its own goals while pretending to pursue our goals.

-Even if the robot is really pursuing our goals, it may not be performing as per expectations.

-Safe and performant robots may be too expensive in respect to unsafe ones.

-Is there a robot that is actually able to do what we request?

-Can “safe” robots be as performant as “unsafe” ones?

-Is the robot’s behaviour stable and predictable under small perturbations and over long periods of time?

-Is the cost of work compatible with our budget?

Testing Phase, Theory.

 

Is there a foolproof way to monitor a robot?

It’s a problem of strategic evaluation and KPI selection.

I wonder if X could be assessed by means of Y.

X = my robot’s work

Y = running this evaluation

 

It’s about devising flawless tests for the robot’s work.

-The robot may try to hide its true intentions and activities.

-Even if the robot’s intentions were good, it may be tempted to hide some of its failures.

-Even if the robot’s intentions were good and its actions successful, it may be tempted to appear more useful than it actually was.

-Can human bias and fact misrepresentation be minimised?

-Can we understand the true drives & capabilities of a robot?

-Can a robot be altered to the point that it changes behaviour, beliefs, and/or memory?

-Can we make robots inoffensive as soon as maintenance is no longer possible?

Testing Phase, Practice.

 

Is the robot doing something we did not ask for (and we did not want either)?

It’s a problem of auditing and compliance checking.

I wish to assess X by means of Y.

X = my robot’s work

Y = running this evaluation

 

It’s about judging reliably / fairly despite advanced deceptions.

-The robot may make minimal overlooked changes that sum up over time.

-The robot may use tricks (such as optical illusions) to alter our perception of its work.

-The robot may be hacked by a malicious actor, or it may even hack itself.

-Are our analysis tools adequate for assessing the robot?

-Can we monitor all the robot’s activities in a log? Can the log be tampered with?

-Are robots susceptible to external attacks / hacking?

-Can we successfully halt our robots anytime?

-Can we detect secret robot cooperation?

Conclusion

I hope some of my “new” questions will be considered and adopted by the community of ML devs. I will be happy to include more suggestions from the community and I plan to keep this list updated.

Let me take this chance to wish a happy new year 2025 to everyone.

Further Links

Control Vectors as Dispositional Traits (my first post)

All the Following are Distinct (my second post)

An Opinionated Look at Inference Rules (my third post).

Can AI Quantity beat AI Quality? (my previous post)

Who I am

My name is Gianluca Calcagni, born in Italy, with a Master of Science in Mathematics. I am currently (2025) working in IT as a consultant with the role of Salesforce Certified Technical Architect. My opinions do not reflect the opinions of my employer or my customers. Feel free to contact me on Twitter or Linkedin.

Revision History

[2024-12-31] Post published.

  1. ^

    Note that X represents the ends while Y represents the means: however, I am not stating that the ends justify the means - rather, I am only suggesting to analyse them in pairs.

  2. ^

    I am curious to know if Evan would agree with me here, as I am condensing a long discussion and my interpretation may differ from his.



Discuss

The Plan - 2024 Update

31 декабря, 2024 - 16:29
Published on December 31, 2024 1:29 PM GMT

This post is a follow-up to The Plan - 2023 Version. There’s also The Plan - 2022 Update and The Plan, but the 2023 version contains everything you need to know about the current Plan. Also see this comment and this comment on how my plans interact with the labs and other players, if you’re curious about that part.

What Have You Been Up To This Past Year?

Our big thing at the end of 2023 was Natural Latents. Prior to natural latents, the biggest problem with my math on natural abstraction was that it didn’t handle approximation well. Natural latents basically solved that problem. With that theoretical barrier out of the way, it was time to focus on crossing the theory-practice gap. Ultimately, that means building a product to get feedback from users on how well our theory works in practice, providing an empirical engine for iterative improvement of the theory.

In late 2023 and early 2024, David and I spent about 3-4 months trying to speedrun the theory-practice gap. Our target product was an image editor; the idea was to use a standard image generation net (specifically this one), and edit natural latent variables internal to the net. It’s conceptually similar to some things people have built before, but the hope would be that natural latents would better match human concepts, and therefore the edits would feel more like directly changing human-interpretable things in the image in natural ways.

When I say “speedrun” the theory-practice gap… well, the standard expectation is that there’s a lot of iteration and insights required to get theory working in practice (even when the theory is basically correct). The “speedrun” strategy was to just try the easiest and hackiest thing at every turn. The hope was that (a) maybe it turns out to be that easy (though probably not), and (b) even if it doesn’t work we’ll get some useful feedback. After 3-4 months, it indeed did not work very well. But more importantly, we did not actually get much useful feedback signal. David and I now think the project was a pretty major mistake; it cost us 3-4 months and we got very little out of it.

After that, we spent a few months on some smaller and more theory-ish projects. We worked out a couple more pieces of the math of natural latents, explained what kind of model of semantics we’d ideally like (in terms of natural latents), wrote up a toy coherence theorem which I think is currently the best illustration of how coherence theorems should work, worked out a version of natural latents for Solomonoff inductors[1] and applied that to semantics as well, presented an interesting notion of corrigibility and tool-ness, and put together an agent model which resolved all of my own most pressing outstanding confusions about the type-signature of human values. There were also a few other results which we haven’t yet written up, including a version of the second law of thermo more suitable for embedded agents, and some more improvements to the theory of natural latents, as well as a bunch of small investigations which didn’t yield anything legible.

Of particular note, we spent several weeks trying to apply the theory of natural latents to fluid mechanics. That project has not yet yielded anything notable, but it’s of interest here because it’s another plausible route to a useful product: a fluid simulation engine based on natural latent theory would, ideally, make all of today’s fluid simulators completely obsolete, and totally change the accuracy/compute trade-off curves. To frame it in simulation terms, the ideal version of this would largely solve the challenges of multiscale simulation, i.e. eliminate the need for a human to figure out relevant summary statistics and hand-code multiple levels. Of course that project has its own nontrivial theory-practice gap to cross.

At the moment, we’re focused on another project with an image generator net, about which we might write more in the future.

Why The Focus On Image Generators Rather Than LLMs?

At this stage, we’re not really interested in the internals of nets themselves. Rather, we’re interested in what kinds of patterns in the environment the net learns and represents. Roughly speaking, one can’t say anything useful about representations in a net until one has a decent characterization of the types of patterns in the environment which are represented in the first place.[2]

And for that purpose, we want to start as “close to the metal” as possible. We definitely do not want our lowest-level data to be symbolic strings, which are themselves already high-level representations far removed from the environment we’re trying to understand.

And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake

Any Major Changes To The Plan In The Past Year?

In previous years, much of my relative optimism stemmed from the hope that the field of alignment would soon shift from pre-paradigmatic to paradigmatic, and progress would accelerate a lot as a result. I’ve largely given up on that hope. The probability I assign to a good outcome has gone down accordingly; I don’t have a very firm number, but it’s definitely below 50% now.

In terms of the plan, we’ve shifted toward assuming we’ll need to do more of the work ourselves. Insofar as we’re relying on other people to contribute, we expect it to be a narrower set of people on narrower projects.

This is not as dire an update as it might sound. The results we already have are far beyond what I-in-2020 would have expected from just myself and one other person, especially with the empirical feedback engine not really up and running yet.  Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track. And that kind of productivity multiplier is not out of the question; I already estimate that working with David has been about a 3x boost for me, so we’d need roughly that much again. Especially if we get the empirical feedback loop up and running, another 3-4x is very plausible. Not easy, but plausible.

Do We Have Enough Time?

Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.

If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.

If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.

  1. ^

    Woohoo! I’d been wanting a Solomonoff version of natural abstraction theory for years.

  2. ^

    The lack of understanding of the structure of patterns in the environment is a major barrier for interp work today. The cutting edge is “sparse features”, which is indeed a pattern which comes up a lot in our environment, but it’s probably far from a complete catalogue of the relevant types of patterns.



Discuss

Zombies among us

31 декабря, 2024 - 08:14
Published on December 31, 2024 5:14 AM GMT

I met a man in the Florida Keys who rents jet skis at $150/hour. Since nobody jet skis alone, he makes at least $300/hour. When there’s no customers he sits around watching sports. After work he plays with his two sons. I asked if he likes his lifestyle. He loves it.

 

Later when I was in Miami, I saw the walking dead. Zombies. The working class people who wished they weren’t alive. I remember what that’s like… 

 

Wake up early. Take the bus with people who never make eye contact with you beyond a quick glance. Grab a stimulant, usually coffee, to force your brain awake. Clock in. Grind for the next eight hours. Maybe your boss is cool, but often not. Maybe your coworkers are cool, but often not. Find ways to cope with your physical or psychic pain (which if you don’t have yet, it’s just a matter of time until you do). Then hope the pain goes away since the only form of healthcare you can afford is prayer.

At the end of the day of doing hard physical labor that grinds your body to a pulp, or dealing with ungrateful customers you’re forced to smile at, your efforts are rewarded with the legal minimum wage: $8/hour. So 40 hours per week (assuming you don’t also have to work weekends) makes you $320—roughly what the jet ski guy makes in an hour.

Having an apartment to yourself is out of the question. Roommates are mandatory—you just have to decide how many you can tolerate, and if you don’t mind sharing a room with a complete stranger. Maybe your roommates are cool, but often not. Maybe your neighbors are cool and have the common decency to not blast music at two in the morning, but often not.

Whether at work or at home, your only privacy is confined to the bathroom. Except you never shit at home—you save that for work so you can get paid for it. You calculate how much per month you make shitting. It makes you feel like you’re getting one over on your employer.

Maybe everything in your apartment functions normally, but often not. When something breaks, it stays broken. You can ask the landlord to fix it, but it’ll take months for him to take a look. And when he does, he may blame you and say it’s not his problem. But your expectations weren’t too high anyway: when touring the apartment you asked if there were any amenities—the landlord laughed because he assumed you were joking.

You sit down in your bedroom with your frozen dinner and numb yourself with your drug of choice: alcohol, weed, video games, porn (because if your lifestyle doesn’t chase away romantic partners, then your low self-esteem will), Netflix, doom scrolling on TikTok, sports gambling (because the parlay will surely work this time and nobody loses forever, right?), Facebook, online shopping, YouTube—anything to help you forget you’re alive.

Before bed you brush your teeth but can’t bring yourself to look in the mirror. Staring into your sunken eyes would only invite the negative self-talk anyway. As a precaution, you wear headphones and blast music to avoid listening to your thoughts.

But what you can't ignore is the tightness in your chest that never goes away. Maybe it’s a heart attack. Sometimes you hope it’s a heart attack. At first, thoughts like this scare you. Then they become background noise.

You collapse on your bed but you don’t “go to sleep.” You never “go to sleep.” You pass out.

 

When you’re young there’s a spark of hope. You wonder if things will change. And with each passing day, that spark withers. But all hope is not lost. Rather, it changes form: you used to hope that things would get better; now you hope that things won’t get worse.

Don’t worry too much because you won’t be there to experience it. You’ll numb yourself out of existence. You’ll become the walking dead. A zombie.

 

Then you wake up the next day.



Discuss

Two Weeks Without Sweets

31 декабря, 2024 - 06:30
Published on December 31, 2024 3:30 AM GMT

I recently tried giving up sweets for two weeks. In early December I attended a conference, which meant a break from my normal routine. After a few days I realized this was the longest I'd gone without eating any sweets in 2-3 decades. After getting home I decided to go a bit longer to see if anything interesting happened. Would my focus be better? Better sleep? Would I feel healthier? Would I stop missing them?

While I started by accident, if I was going to keep doing this I needed a system: what counts as a sweet? I decided to use the same system we use with the kids: no desserts or other things that are about being sweet. The sweetest thing I was still eating was probably bran flakes for breakfast, at 6g per serving. [1]

I did pretty quickly stop feeling a drive to eat sweet things. Which was kind of neat: normally I do feel compelled to eat something sweet after lunch and again after dinner. But I didn't feel better: the main change I noticed was that I had less energy in the afternoons and maybe made more mistakes at work. I also think I was a mildly worse Go player, though any comparisons here are going to be unreliable since I'm just playing one other person (my dad) over and over.

My main sweet consumption is ganache (a couple spoonfuls from a jar), usually after lunch or dinner. Perhaps the fat is helpful here? Or maybe I'd become dependent on the caffeine in the chocolate? Probably not, though: I'm guessing the amount I was eating came to ~15mg of caffeine, so only 8% to 15% of a typical coffee serving. Claude guessed this was too low to be distinguishable from placebo, but I haven't looked in the literature to verify.

When I restarted sweets I noticed pretty quickly that I felt better in the afternoon, my Go playing was better, and also that I was sleeping slightly better. [2] I didn't notice any downsides. I suspect some of this pattern is that most sweets I eat (primarily ganache, followed by other products high in cream) have a pretty high ratio of fat to sugar? I do know that when I eat mostly-sugar sweets I don't feel great afterwards.

One place where I thought I might notice a change was weight, but that's a bit confusing. I lost ~3%, mostly after restarting sweets. I don't know what to make of this; it's the opposite direction of what I'd expect to see. But 3% is also pretty small, so I'm not going to read much into this.

My main takeaway here is that the way I've been eating is fine, and I'm not planning to change anything other than bringing some with me next time I travel to a conference.


[1] This is higher than would be ideal. I should see if I can find a lower-sugar brand.

[2] Though still not great: since having kids (or just getting older) I haven't been able to sleep anywhere near as well as I could in my mid 20s and younger, even though my kids don't wake me up in the night anymore. Mainly I wake up ~1hr before my body feels like it's had enough sleep. It's nice to have the extra time, but I'd rather have the sleep.

Comment via: facebook, mastodon, bluesky



Discuss

Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models

31 декабря, 2024 - 01:50
Published on December 30, 2024 10:50 PM GMT

Thanks to Jean Kaddour, Tomáš Dulka, and Joseph Bloom for providing feedback on earlier drafts of this post.

In a previous post on Toy Models of Feature Absorption, we showed that tied SAEs seem to solve feature absorption. However, when we tried to training some tied SAEs on Gemma 2 2b, these still appeared to suffer from absorption effects (or something similar). In this post, we explore how this is possible by extending our investigation to toy settings where the SAE has more or fewer latents than true features. We hope this will build intuition for how SAEs work and what sorts of failure modes they have. Some key takeaways:

  • Tied SAEs fail to solve absorption in toy settings when there are fewer SAE latents than true features, but continue to solve absorption when there are more latents than true features.
  • When absorption occurs in a tied SAE latent, we see multiple peaks in the latent's activation histogram.
  • We can solve absorption in toy settings with fewer latents than true features by using a tied SAE with a loss term which penalizes lower magnitude latent activations.
  • Narrow SAEs still suffer from absorption effects, even down to SAEs with a single latent. This means we should not expect Matryoshka SAEs[1][2] to perfectly solve absorption, as Matryoshka SAEs assume that narrow SAEs have no absorption effects.
  • Tied SAEs have a built-in orthogonality bias. The only way for tied SAEs to reduce interference between latents is to make them as orthogonal as possible. This bias is likely why tied SAEs are more resistant to absorption than untied SAEs.

We use the term "absorption" loosely above to mean the SAE latents are learning messed-up combinations of features rather than each latent matching a single true feature. Our goal is for the SAE latents to have a 1-to-1 match with a true feature direction. We refer to this undesirable feature mixing as "broken latents" for the rest of this post to cover all cases where the SAE learns incorrect representations.

The code for this post is in this Colab Notebook

Background: Absorption and Tied SAEs

Feature absorption is a degenerate form of feature splitting involving a hierarchical relationship between parent and child features, where the child feature is active whenever the parent feature is active. In feature absorption, the SAE learns a latent which seems to track the parent feature and the child feature. However, the parent latent fails to activate when the child latent is active. In addition, the child latent absorbs a component of the parent latent into its decoder. The parent latent is effectively gerrymandered, with an exception in its firing pattern when the child latent is active.

We incentivize SAEs to have sparse latent activations, so the SAE will try to minimize the number of active latents needed to represent any given input. Absorption is a logical consequence of this sparsity: If a parent feature activates every time a child feature activates, the SAE can just fire one latent to represent both the child and parent feature together whenever the child feature is active. However, this results in a less interpretable latents, as a latent which seems to track the parent feature is actually tracking the parent feature with exceptions. The latent tracking the child feature ends up mixing both parent and child feature representations together in its decoder.

We first noticed absorption in Gemma Scope SAEs in our paper A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. We then demonstrated feature absorption in a toy model.

To recap the main finding in the Toy Models of Feature Absorption post, we first considered a toy setting with 4 true features, each represented by a 50 dim vector, where all features are mutually orthogonal. In this setting, every time that feature 1 (the child feature) is active, feature 0 (the parent feature) must also be active, inducing a co-occurrence relation between these features. We construct training samples by randomly sampling feature firings and summing the resulting feature vectors.

We trained a standard L1 loss untied SAE with 4 latents to reconstruct these activations. Below, we see the cosine similarity of the trained SAE latents with the underlying true features for both the SAE encoder and decoder.

We see that the independently firing features, feature 2 and feature 3 are perfectly detected and reconstructed by the SAE. However, latent 0 which tracks feature 0 fires when feature 0 is active and feature 1 is NOT active. Instead, when feature 1 is active, the SAE decoder for feature 1 contains the sum of features 0 and 1. We summarize this below:

LatentsEncoderDecoderLatent 0 feat0 ^ ¬feat1feat0Latent 1feat3feat3Latent 2feat2feat2Latent 3feat1feat0 + feat1

Clearly, this is not ideal! We want each latent to detect and reconstruct a true feature, not a mixture of features with exceptions.

However, one insight from the untied SAE absorption case above is that there is an asymmetry between the encoder and the decoder necessary to create absorption. What if we use a tied SAE where the encoder and the decoder must be identical instead?

Indeed, using a tied SAE solves absorption in this simple case. For a more in-depth overview of this toy setting, and further experiments including a superposition setup, see the original Toy Models of Feature Absorption post.

After our work showing that tied SAEs seem to solve absorption in our toy setting, we naturally tried training some tied jumprelu SAEs on a real LLM (Gemma-2-2b) to check if we've solved absorption and fully broken-down latents into their constituent parts, and found that we still could detect some problematic patterns in these tied SAEs. How could this be possible?

Important Terms

Untied Sparse Autoencoder: Our Sparse Autoencoders (SAEs) are characterized by the equation below, where Wdec∈RD×K.mjx-chtml {display: inline-block; line-height: 0; text-indent: 0; text-align: left; text-transform: none; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; word-wrap: normal; word-spacing: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; margin: 0; padding: 1px 0} .MJXc-display {display: block; text-align: center; margin: 1em 0; padding: 0} .mjx-chtml[tabindex]:focus, body :focus .mjx-chtml[tabindex] {display: inline-table} .mjx-full-width {text-align: center; display: table-cell!important; width: 10000em} .mjx-math {display: inline-block; border-collapse: separate; border-spacing: 0} .mjx-math * {display: inline-block; -webkit-box-sizing: content-box!important; -moz-box-sizing: content-box!important; box-sizing: content-box!important; text-align: left} .mjx-numerator {display: block; text-align: center} .mjx-denominator {display: block; text-align: center} .MJXc-stacked {height: 0; position: relative} .MJXc-stacked > * {position: absolute} .MJXc-bevelled > * {display: inline-block} .mjx-stack {display: inline-block} .mjx-op {display: block} .mjx-under {display: table-cell} .mjx-over {display: block} .mjx-over > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-under > * {padding-left: 0px!important; padding-right: 0px!important} .mjx-stack > .mjx-sup {display: block} .mjx-stack > .mjx-sub {display: block} .mjx-prestack > .mjx-presup {display: block} .mjx-prestack > .mjx-presub {display: block} .mjx-delim-h > .mjx-char {display: inline-block} .mjx-surd {vertical-align: top} .mjx-surd + .mjx-box {display: inline-flex} .mjx-mphantom * {visibility: hidden} .mjx-merror {background-color: #FFFF88; color: #CC0000; border: 1px solid #CC0000; padding: 2px 3px; font-style: normal; font-size: 90%} .mjx-annotation-xml {line-height: normal} .mjx-menclose > svg {fill: none; stroke: currentColor; overflow: visible} .mjx-mtr {display: table-row} .mjx-mlabeledtr {display: table-row} .mjx-mtd {display: table-cell; text-align: center} .mjx-label {display: table-row} .mjx-box {display: inline-block} .mjx-block {display: block} .mjx-span {display: inline} .mjx-char {display: block; white-space: pre} .mjx-itable {display: inline-table; width: auto} .mjx-row {display: table-row} .mjx-cell {display: table-cell} .mjx-table {display: table; width: 100%} .mjx-line {display: block; height: 0} .mjx-strut {width: 0; padding-top: 1em} .mjx-vsize {width: 0} .MJXc-space1 {margin-left: .167em} .MJXc-space2 {margin-left: .222em} .MJXc-space3 {margin-left: .278em} .mjx-test.mjx-test-display {display: table!important} .mjx-test.mjx-test-inline {display: inline!important; margin-right: -1px} .mjx-test.mjx-test-default {display: block!important; clear: both} .mjx-ex-box {display: inline-block!important; position: absolute; overflow: hidden; min-height: 0; max-height: none; padding: 0; border: 0; margin: 0; width: 1px; height: 60ex} .mjx-test-inline .mjx-left-box {display: inline-block; width: 0; float: left} .mjx-test-inline .mjx-right-box {display: inline-block; width: 0; float: right} .mjx-test-display .mjx-right-box {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0} .MJXc-TeX-unknown-R {font-family: monospace; font-style: normal; font-weight: normal} .MJXc-TeX-unknown-I {font-family: monospace; font-style: italic; font-weight: normal} .MJXc-TeX-unknown-B {font-family: monospace; font-style: normal; font-weight: bold} .MJXc-TeX-unknown-BI {font-family: monospace; font-style: italic; font-weight: bold} .MJXc-TeX-ams-R {font-family: MJXc-TeX-ams-R,MJXc-TeX-ams-Rw} .MJXc-TeX-cal-B {font-family: MJXc-TeX-cal-B,MJXc-TeX-cal-Bx,MJXc-TeX-cal-Bw} .MJXc-TeX-frak-R {font-family: MJXc-TeX-frak-R,MJXc-TeX-frak-Rw} .MJXc-TeX-frak-B {font-family: MJXc-TeX-frak-B,MJXc-TeX-frak-Bx,MJXc-TeX-frak-Bw} .MJXc-TeX-math-BI {font-family: MJXc-TeX-math-BI,MJXc-TeX-math-BIx,MJXc-TeX-math-BIw} .MJXc-TeX-sans-R {font-family: MJXc-TeX-sans-R,MJXc-TeX-sans-Rw} .MJXc-TeX-sans-B {font-family: MJXc-TeX-sans-B,MJXc-TeX-sans-Bx,MJXc-TeX-sans-Bw} .MJXc-TeX-sans-I {font-family: MJXc-TeX-sans-I,MJXc-TeX-sans-Ix,MJXc-TeX-sans-Iw} .MJXc-TeX-script-R {font-family: MJXc-TeX-script-R,MJXc-TeX-script-Rw} .MJXc-TeX-type-R {font-family: MJXc-TeX-type-R,MJXc-TeX-type-Rw} .MJXc-TeX-cal-R {font-family: MJXc-TeX-cal-R,MJXc-TeX-cal-Rw} .MJXc-TeX-main-B {font-family: MJXc-TeX-main-B,MJXc-TeX-main-Bx,MJXc-TeX-main-Bw} .MJXc-TeX-main-I {font-family: MJXc-TeX-main-I,MJXc-TeX-main-Ix,MJXc-TeX-main-Iw} .MJXc-TeX-main-R {font-family: MJXc-TeX-main-R,MJXc-TeX-main-Rw} .MJXc-TeX-math-I {font-family: MJXc-TeX-math-I,MJXc-TeX-math-Ix,MJXc-TeX-math-Iw} .MJXc-TeX-size1-R {font-family: MJXc-TeX-size1-R,MJXc-TeX-size1-Rw} .MJXc-TeX-size2-R {font-family: MJXc-TeX-size2-R,MJXc-TeX-size2-Rw} .MJXc-TeX-size3-R {font-family: MJXc-TeX-size3-R,MJXc-TeX-size3-Rw} .MJXc-TeX-size4-R {font-family: MJXc-TeX-size4-R,MJXc-TeX-size4-Rw} .MJXc-TeX-vec-R {font-family: MJXc-TeX-vec-R,MJXc-TeX-vec-Rw} .MJXc-TeX-vec-B {font-family: MJXc-TeX-vec-B,MJXc-TeX-vec-Bx,MJXc-TeX-vec-Bw} @font-face {font-family: MJXc-TeX-ams-R; src: local('MathJax_AMS'), local('MathJax_AMS-Regular')} @font-face {font-family: MJXc-TeX-ams-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_AMS-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_AMS-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_AMS-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-B; src: local('MathJax_Caligraphic Bold'), local('MathJax_Caligraphic-Bold')} @font-face {font-family: MJXc-TeX-cal-Bx; src: local('MathJax_Caligraphic'); font-weight: bold} @font-face {font-family: MJXc-TeX-cal-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-R; src: local('MathJax_Fraktur'), local('MathJax_Fraktur-Regular')} @font-face {font-family: MJXc-TeX-frak-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-frak-B; src: local('MathJax_Fraktur Bold'), local('MathJax_Fraktur-Bold')} @font-face {font-family: MJXc-TeX-frak-Bx; src: local('MathJax_Fraktur'); font-weight: bold} @font-face {font-family: MJXc-TeX-frak-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Fraktur-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Fraktur-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Fraktur-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-BI; src: local('MathJax_Math BoldItalic'), local('MathJax_Math-BoldItalic')} @font-face {font-family: MJXc-TeX-math-BIx; src: local('MathJax_Math'); font-weight: bold; font-style: italic} @font-face {font-family: MJXc-TeX-math-BIw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-BoldItalic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-BoldItalic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-BoldItalic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-R; src: local('MathJax_SansSerif'), local('MathJax_SansSerif-Regular')} @font-face {font-family: MJXc-TeX-sans-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-B; src: local('MathJax_SansSerif Bold'), local('MathJax_SansSerif-Bold')} @font-face {font-family: MJXc-TeX-sans-Bx; src: local('MathJax_SansSerif'); font-weight: bold} @font-face {font-family: MJXc-TeX-sans-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-sans-I; src: local('MathJax_SansSerif Italic'), local('MathJax_SansSerif-Italic')} @font-face {font-family: MJXc-TeX-sans-Ix; src: local('MathJax_SansSerif'); font-style: italic} @font-face {font-family: MJXc-TeX-sans-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_SansSerif-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_SansSerif-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_SansSerif-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-script-R; src: local('MathJax_Script'), local('MathJax_Script-Regular')} @font-face {font-family: MJXc-TeX-script-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Script-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Script-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Script-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-type-R; src: local('MathJax_Typewriter'), local('MathJax_Typewriter-Regular')} @font-face {font-family: MJXc-TeX-type-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Typewriter-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Typewriter-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Typewriter-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-cal-R; src: local('MathJax_Caligraphic'), local('MathJax_Caligraphic-Regular')} @font-face {font-family: MJXc-TeX-cal-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Caligraphic-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Caligraphic-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Caligraphic-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-B; src: local('MathJax_Main Bold'), local('MathJax_Main-Bold')} @font-face {font-family: MJXc-TeX-main-Bx; src: local('MathJax_Main'); font-weight: bold} @font-face {font-family: MJXc-TeX-main-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Bold.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-I; src: local('MathJax_Main Italic'), local('MathJax_Main-Italic')} @font-face {font-family: MJXc-TeX-main-Ix; src: local('MathJax_Main'); font-style: italic} @font-face {font-family: MJXc-TeX-main-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-main-R; src: local('MathJax_Main'), local('MathJax_Main-Regular')} @font-face {font-family: MJXc-TeX-main-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-math-I; src: local('MathJax_Math Italic'), local('MathJax_Math-Italic')} @font-face {font-family: MJXc-TeX-math-Ix; src: local('MathJax_Math'); font-style: italic} @font-face {font-family: MJXc-TeX-math-Iw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size1-R; src: local('MathJax_Size1'), local('MathJax_Size1-Regular')} @font-face {font-family: MJXc-TeX-size1-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size1-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size1-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size1-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size2-R; src: local('MathJax_Size2'), local('MathJax_Size2-Regular')} @font-face {font-family: MJXc-TeX-size2-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size2-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size2-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size2-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size3-R; src: local('MathJax_Size3'), local('MathJax_Size3-Regular')} @font-face {font-family: MJXc-TeX-size3-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size3-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size3-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size3-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-size4-R; src: local('MathJax_Size4'), local('MathJax_Size4-Regular')} @font-face {font-family: MJXc-TeX-size4-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Size4-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Size4-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Size4-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-R; src: local('MathJax_Vector'), local('MathJax_Vector-Regular')} @font-face {font-family: MJXc-TeX-vec-Rw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Regular.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Regular.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Regular.otf') format('opentype')} @font-face {font-family: MJXc-TeX-vec-B; src: local('MathJax_Vector Bold'), local('MathJax_Vector-Bold')} @font-face {font-family: MJXc-TeX-vec-Bx; src: local('MathJax_Vector'); font-weight: bold} @font-face {font-family: MJXc-TeX-vec-Bw; src /*1*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/eot/MathJax_Vector-Bold.eot'); src /*2*/: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/woff/MathJax_Vector-Bold.woff') format('woff'), url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/fonts/HTML-CSS/TeX/otf/MathJax_Vector-Bold.otf') format('opentype')}  is the SAE decoder,  Wenc∈RK×D is the SAE encoder, K is the SAE hidden size, D is the SAE input size, bdec∈RD is the decoder bias, benc∈RK is the SAE encoder bias, and σ is a non-linearity, typically ReLU or JumpReLU:

h=σ(Wenc(x−bdec)+benc)^x=Wdech+bdec

We refer to these standard SAEs as untied SAEs to differentiate them from Tied SAEs below. 

Tied Sparse Autoencoder: Our tied SAE is the same as our untied SAE, except with the following constraints: Wenc=WTdec and benc=0. In tied SAEs, we use to W to mean Wdec and b to mean bdec as below:

h=σ(WTx−b)^x=Wh+b

Tied SAEs were used in early dictionary learning for interpretability work[3], but fell out of favor after Anthropic stopped using tied SAEs in Towards Monosemanticity.

Parent and child features: When we investigate feature co-occurrence, we construct feature firing patterns when one feature must be active when another feature is active. This is typical of features in a hierarchy, for instance "animal" and "dog". If feature 0 is "animal" and feature 1 is "dog", then whenever feature 1 is active feature 0 must also be active, since a dog is an animal. We refer to feature 0 here as the "parent feature" in the relationship, and feature 1 as a "child feature". There can be multiple child features for each parent feature.

What happens if the SAE has more latents than true features?

In practice we don't know how many true features exist in a deep learning model, so we'll almost never have an SAE that has the exact number of latents as true features. While it seems unlikely we'd be able to train a SAE that has too many features on a LLM foundation model, we could imagine this happening for smaller models, for example game-playing models[4][5].

We begin with a setup containing 4 true features each firing independently with magnitude 1.0. All features have 20 dim representations, and are mutually orthogonal, so no superposition. In this setup, all features fire with probability 0.2. We begin by training an untied SAE with 8 latents on this toy setup.

A perfect SAE should learn the 4 true features and allow the remaining 4 features to die off.

Untied SAEs misuse excess capacity

Below, we plot the cosine similarity of the encoder and decoder of the untied SAE with the true features.

The SAE learns a mix of correct and broken latents. two of the excess are correctly killed off (latents 1 and 4), but one latent is a duplicate (latents 6 and 7). The SAE also learns a feature for the combination of features 0 and 1 together, in addition to features 0 and 1 firing separately. This is very similar to feature absorption, as the SAE learns to not fire the latents tracking feature 0 and 1 on their own when this combo latent is active. This sort of problematic combo latent was predicted by previous work[6]. Below, we see some firing patterns for sample true features.

When features 0 and 1 activate together, the SAE activates only the combo latent, and thus the L1 is less than the sum of true feature magnitudes. The SAE has found a way to use its extra capacity to "cheat" and find ways of representing features with fewer latents than it should.

Tied SAEs learn perfect representations despite excess capacity

We run the same experiment above, but using a tied SAE. below is the resulting cosine similarity between true features and SAE decoder latents. Since tied SAEs have identical encoder and decoder, we only present the decoder.

The tied SAE learns to perfectly represent the true features, with one latent per true feature. The SAE kills off all latents which do not map onto a true feature

Tied SAEs continue to solve absorption with more latents than true features

Next, we add co-occurrence to the feature firing patterns, setting up a parent / child relationship with feature 0 as parent feature and features 1 and 2 as child features. This means every time feature 1 or 2 fires, feature 0 must also fire, but feature 0 can also fire on its own. This sort of co-occurrence pattern would normally cause feature absorption in untied SAEs. We only investigate this with a tied SAE, as untied SAEs already learn broken latents for the independent features case, and we have already shown that untied SAEs suffer from feature absorption when training on co-occurring features.

We still see the tied SAE is able to perfectly reconstruct the true features despite the feature co-occurrence!

What happens if the SAE has fewer latents than features?

When an SAE has fewer latents than there are true features, the SAE will have to pick which features to represent and which to ignore. We use a toy setting with 20 features in 50 dimensions. These feature are thus fully orthogonal and there is no superposition (we will examine superposition later in the post). The firing probability of these 20 features increases linearly to 0.3 from index 0 to 19, and the magnitude of the features linearly decreases from 20 at index 0 to 1 at index 19.

Below we plot the magnitudes, firing probabilities, and expected MSE (probability × magnitude²) for each feature below.

For experiments in this section, we use a SAE with 5 latents. Since SAEs are trained using MSE loss, we expect that the SAE will choose to represent the features with the largest expected MSE (probability × magnitude²). In our toy setup with 5 latents , this corresponds to features 4,5,6,7, and 8.

Below we train a 5-latent tied SAE on theses features, with all features being mutually indepedent. 

Here, we see the SAE perfectly reconstructs features 4-8, as we predicted. Below we try the same experiment using a 5-latent untied SAE.

The untied SAE also perfectly reconstructs the 5 true features we predicted by max expected MSE, features 4-8.

Co-occurrence breaks tied SAEs

Next, we introduce a co-occurrence relationship where anytime that features 5 or 7 fire, feature 12 must also fire. This means that feature 12 is a parent feature and features 5 and 7 are child features in our hierarchical setup.

The tied SAE no longer learns clean representations for features 5 and 7. Both these latents now include a portion of feature 12. Features 5 and 7 each also include a negative component of each other. Since the SAE can no longer achieve perfect reconstruction, it settles into this mixed representation instead. The negative component between features 5 and 7 is likely to compensate for the situation when both latents fire together and thus too large a component of feature 12 would be included in the reconstruction.

It's not obvious that we should call this phenomenon "absorption", but it's clearly problematic.

Tied SAEs have a bias for orthogonal latents

A natural idea to try to fix the above broken latents would be to add a loss to force latents to be orthogonal to each other. However, this won't help here, because the latents are all already orthogonal! Below is a plot of the cosine similarities of the learned SAE latents to each other:

Tied SAEs can only reduce interference between latents by making them orthogonal to each other, so tied SAEs are heavily incentivized to learn mutually-orthogonal latents. This orthogonality bias is likely why tied SAEs perform better than untied SAEs with feature co-occurrence in general.

Multiple activation peaks indicate absorption

If we investigate the activation magnitudes on a sample of activations for these latents, we notice the following pattern:

For latents 1,2 and 3, there is only a single peak in the latent activations. However, for latents 0 and  4 which correspond to the messed-up merged latents, there are 4 visible peaks. When the main feature tracked by these messed-up latents is active, the latent fires strongly. However, when that main feature is not active and feature 12 is active on its own, the latent fires weakly. This is shown for latent 4 below:

This asymmetry in the activation magnitudes for latent 4 is caused by the fact that sometimes feature 12 fires on its own, and sometimes it fires with feature 7. When feature 12 fires on its own, the latent only activates weakly. The 2 variations on high and low activations come from the negative component of feature 5 in latent 4. Can we just force the latent to be orthogonal to activations which would cause the latent to fire weakly? Removing the feature 12 component from latent 4 should also remove the incentive for latent 4 to learn a negative component of feature 5.

Incentivizing a single activation peak

In a real SAE, we don't know what the ground-truth features are, but we can pretty easily find the activation magnitudes of each latent by testing out the SAE on sample model inputs. If low-activating values of a latent correspond to mixtures of underlying features, and the highest-activating cluster corresponds to a real feature we want to track, we can just set a threshold somewhere in the middle and penalize the SAE latent for not being orthogonal to any activation where the latent fires below that threshold.

We adjust our training procedure as follows:

  • We keep a running estimate of the max activation of each latent over the past 50 batches, and set a threshold τj to be 0.7 * max_act for each latent j.
  • An additional loss term of cos(x−b,Wj)2 is added for each latent where the latent activation is less than the threshold τj, where Wj corresponds to decoder representation of latent j, and x is an input activation.
  • We first train the SAE as normal with no auxiliary loss and threshold = 0 before linearly warming up both the loss and threshold. We refer to the initial train period as "delay".
  • Optionally, a JumpReLU activation can be used with threshold τj to further improve performance. However, this is not necessary to fix absorption, and thus the rest of the post uses a standard ReLU activation.

The loss term is defined below, where B is the batch size, K is the number of latents, τj is the threshold for the latent j, Wj∈RD is the decoder representation of latent j, and Laux is the auxiliary loss coefficient:

\tau_j \\ cos(x_i - b, W_j)^2 & \text{otherwise } \end{cases}">Laux=λauxBKB∑i=0K∑j=0{0if hi,j>τjcos(xi−b,Wj)2otherwise 

We now train a SAE using our new training method with the following hyperparams:

  • λaux = 10,000
  • delay aux loss and threshold for 50 million samples
  • warmup aux loss and threshold for 50 million samples
  • train for 150 million samples total

Using this new training scheme, we again perfectly recover true features despite co-occurrence!

Co-occurrence with a high MSE feature

In our examples so far, we've used feature 12 as the parent feature for co-occurrence. Feature 12 has a small enough expected MSE that the SAE would not have tried to represent it anyway. What happens if we make the parent feature be a feature the SAE does represent already?

Next, we change our co-occurrence setup so that any time features 5 or 7 fire, feature 6 must also fire. This means feature 6, the feature with the highest expected MSE, is now the parent feature. Below we train a standard tied SAE on this setup:

Again, we see merged combinations of latents 5,6, and 7. Interestingly, the SAE no longer devotes a single latent to each of features 5,6,7, devoting only 2 latents to the combinations of these 3 features. Instead, the SAE is now representing feature 3 as well.

Next, we use our new training setup to see if this will address the combo latents we see above.

Here we see the SAE is now correctly representing true features! However, something strange has happened to feature 6. The SAE no longer represents feature 6 at all, despite this feature resulting in the highest expected MSE loss of all. This is probably an unintended side-effect of our orthogonality loss making it difficult for the SAE to move the latent currently tracking feature 3 to a new position tracking feature 6. Still, at least all the latents are now perfectly tracking true features.

Superposition

So far, our toy setup has fully orthogonal features and thus no superposition. Next, we'll reduce the dimensions of our true features to 19, so each feature cannot be fully orthogonal to each other feature. We still try to make these features as orthogonal as possible, resulting in features with cosine similarity of ±0.05 with each other. The cosine similarities of true features with each other is shown below:

We begin by training a standard tied SAE using these features along with the same probabilities and magnitudes from earlier experiments. We continue with the co-occurrence pattern from the previous experiment, where feature 6 must fire when either feature 5 or feature 7 fires. We increase the l1 coefficient to 1e-2 and train for 150 Million samples.

We see a noisier version of the non-superposition case, where features 3, 4, and 8 are clearly tracked, but features 5, 6, and 7 are mixed together. We now train using our modified SAE training regime:

We see again roughly what we saw in the non-superposition case. The SAE learns clean latents for features 3,4,5,7 and 8, but not feature 6.

What about SAEs trained on real LLMs?

We have so far struggled to get this technique to work well on real LLM SAEs. We suspect this is due to the activation peaks in real SAEs not being clearly separable like they are in our toy example, or there being too many features absorbed into latents. If the activation peaks between the main feature a latent is trying to track and absorbed features are overlapping, it's not obvious how to decide which activations to penalize. We likely need a smarter way to model the activation peaks in real SAE latent activation histograms, possibly via clustering, or a mixture of gaussians model. It's also not obvious that in real models, the highest-activating peak is actually the main feature we want the latent to track when there are multiple visible peaks.

Latent 2386 from chanind/sae-gemma-2-2b-tied-jumprelu SAE with L0=14. This latent seems to represent "Canada", and has multiple visible peaks in its activation histogram.Extreme Narrowness and Matryoshka SAEs

So far our experiments with narrow SAEs still have the SAE needing to represent both the parent feature and the child features in the same SAE. What if we make the SAE so narrow that only the parent feature can be represented? Surely, such an SAE would perfectly reconstruct the parent feature without any interference from child features?

This is the idea behind Matryoshka SAEs[2][1]. In a Matryoshka SAE, the SAE needs to reconstruct the input using subsets of latents of increasing size. This allows the narrower SAE sizes to represent parent features, hopefully without any broken latents, and then latents in the larger nesting size of the Matryoshka SAE can perfectly represent child features.

Co-occurrence breaks single-latent SAEs

We test the hypothesis that a narrow SAE will perfectly learn parent features by training a 1-latent SAE in a toy setting with 4 true features in a parent-child relationship. In our toy model, feature 0 is the parent feature, and features 1 and 2 are child features. Feature 3 fires independently. Feature 0 fires with probability 0.3, and features 1 and 2 both fire with probability 0.4 if feature 0 is active. Feature 3 fires with probability 0.2. All features fire with magnitude 1.0.

We begin by training a single-latent untied SAE on this setup. We hope this SAE's single latent will perfectly represent our parent feature, feature 0.

Sadly, we see our assumption is incorrect. The SAE does represent feature 0 in its single latent, but it also merges in the child features 1 and 2. Feature 3, the independent feature, is fully excluded. Interestingly, the encoder of the untied SAE is nearly identical to the decoder, so the pattern is indeed different than our original absorption pattern for untied SAEs where the encoder for a parent feature had a negative rather than positive cos sim with child features. While this is not technically absorption, this is still a broken latent. 

Next let's try the same experiment using a tied SAE.

The tied SAE learns a nearly identical representation to the untied SAE. Both of these SAEs learn a single broken latent rather than correctly learning the parent feature.

Solving this using our activation orthogonality technique will require tweaking the technique to the do inverse of what we did previously and project out the high activating peaks instead of the low activating peaks. This requires modeling each peak location, and is thus out of scope for this toy example, but is left for future work.

What Does This Mean for Matryoshka SAEs? 

The base assumption underlying why Matroyshka SAEs should solve absorption is not strictly true. That is, it is not true that a narrow SAE will perfectly represent a parent feature from a parent-child relationship. Instead we see that a narrow SAE will learn a broken latent merging the parent and child features together instead. While this isn't technically feature absorption by our original definition, it's also not learning a correct representation of the underlying parent feature.

This doesn't mean that Matryoshka SAEs are not useful, but we should be cautious about assuming the latents in a Matryoshka SAE are tracking true features in spite of feature co-occurrence. It's also possible that under different assumptions about parent/child feature firing probabilities and magnitudes this problem may be less severe. For instance, if the parent feature fires much more frequently on its own than it does with any given child feature, this problem is likely to be less severe. It could be possible that in LLMs, underlying parent/child features follow this pattern, but it's hard to say anything with certainty about true features in LLMs.

We may be able to combine Matryoshka SAEs with variations on our activation orthogonality technique to project out the child features from Matryoshka parent latents, for example. It's possible that using a different loss term from MSE loss might fix this problem. Regardless, we do still feel that techniques that can include a concept of hierarchy in the SAE architecture like Matryoshka SAEs are an exciting direction worth pursuing further.

For more discussion of this issue in Matryoshka SAEs, see this comment on Noa Nabeshima's Matryoshka SAEs post and this colab notebook.

Conclusion

In this post, we've looked at SAEs in more toy settings, examining tied and untied SAEs in scenarios where the SAE is both too wide and too narrow for the amount of true features. Tied SAEs appear to be more resilient to learning broken latents than untied SAEs, but tied SAEs still learn broken latents under feature co-occurrence when the SAE is more narrow than the number of true features. Sadly, this scenario is almost certainly the scenario we're in when we train SAEs on LLMs.

The toy settings in this post are not mathematical proofs, and it is very possible for our conclusions about tied SAEs to not hold under all possible toy settings of feature co-occurrence. That being said, proving what various SAE architectures will learn under what assumptions about underlying true features would be an exciting direction for future research.

In this work, we also present a possible path forward for solving broken latents due to feature co-occurrence based on the observation that broken latents in tied SAEs correspond to multiple peaks in the activation histogram of affected latents. We have so far struggled to operationalize this insight into an absorption resistant SAE trained on a real LLM, and suspect this is due to activation distributions in LLM SAEs having overlapping distributions. We plan to continue investigating whether being smarter about clustering activations in latent activation histograms could help solve this.

We also investigated one of the core assumptions of Matryoshka SAEs, and showed that in general SAEs will learn broken latents even if the SAE is too narrow to represent child features. We do not feel this should discount Matryoshka SAEs, and feel hierarchical SAEs are an exciting direction in general, but we should not expect them to be a perfect solution to feature absorption in their current form.

We hope as well that these toy models can help build intuition for what SAEs may learn and when they might go astray. 

  1. ^

    Learning Multi-Level Features with Matryoshka SAEs [link]
    Bart Bussman and Patrick Leask and Neel Nanda, 2024, Lesswrong

  2. ^

    Matryoshka Sparse Autoencoders [link]
    Noa Nabeshima, 2024, Lesswrong

  3. ^

    Sparse Autoencoders Find Highly Interpretable Features in Language Models [link]
    Hoagy Cunningham and Aidan Ewart and Logan Riggs and Robert Huben and Lee Sharkey, 2023, arXiv:2309.08600

  4. ^

    Planning in a recurrent neural network that plays Sokoban [link]
    Mohammad Taufeeque and Philip Quirke and Maximilian Li and Chris Cundy and Aaron David Tucker and Adam Gleave and Adrià Garriga-Alonso, 2024, arXiv:2407.15421

  5. ^

    Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models [link]
    Adam Karvonen, 2024, arXiv:2403.15498

  6. ^

    Do sparse autoencoders find "true features"? [link]
    Demian Till, 2024, Lesswrong



Discuss

Genetically edited mosquitoes haven't scaled yet. Why?

31 декабря, 2024 - 00:37
Published on December 30, 2024 9:37 PM GMT

A post on difficulty of eliminating malaria using gene drives: "I worked on gene drives for a number of years jointly as a member of George Church and Flaminia Catteruccia’s labs at Harvard. Most of my effort was spent primarily on an idea for an evolutionary stable gene drive, which didn’t work but we learned some stuff, and I developed a sense for the real challenges with gene drives. It’s something that comes up often when I meet new people in the bio community, so this is my attempt at laying out my perspective. I will be upfront and say I haven’t worked in this field since 2019, and though I’ve tried to update my understanding based on 2024 literature, I might have things wrong based on the bleeding edge."



Discuss

The low Information Density of Eliezer Yudkowsky & LessWrong

30 декабря, 2024 - 22:49
Published on December 30, 2024 7:43 PM GMT

TLDR:
I think Eliezer Yudkowsky & many posts on LessWrong are failing at keeping things concise and to the point.

 

Actual post:

I think the content from Eliezer Yudkowsky & on LessWrong in general is unnecessarily wordy.

 

A counterexample of where Eliezer Yudkowsky actually managed to get to the point concisely was in this Ted Talk, where he had the external constraint of keeping it to 10 minutes:

 

An example of a concise post in a forum can be found here, from 2013:

Examples of posts, which showcase the wordiness of LessWrong:


Why I think this matters:

  • I think the long books Eliezer Yudkowsky has published were not the most optimal way to convey the threat of AGI to humanity.
  • I do not think that his books should not have been made. They are good for some people, this group of humans that has the time & motivation to read long books.
  • I think a more concise version would reach more people and would be more effective, e.g. something similar to the TedTalk YouTube video.
  • I think most people just google stuff & then read the shortest summaries quickly or watch shorter YouTube videos.
  • I think there are many benefits of keeping things very elaborate, e.g. it makes sure people really do not misunderstand your point on a crucial matter.
  • I think this is not a pure tradeoff situation. I think often Eliezer Yudkowsky & LessWrong can be more concise while still getting the same exact points across.


Discuss

Linkpost: Look at the Water

30 декабря, 2024 - 22:49
Published on December 30, 2024 7:49 PM GMT

This is a linkpost for https://jbostock.substack.com/p/prologue-train-crash

Epistemic status: fiction, satire even!

I am writing a short story. This is the prologue. Most of it will just go on Substack, but I'll occasionally post sections on LessWrong, when they're particularly good.

At some point in the past, canals and railways were almost equals: each had their merits and drawbacks, and their various proponents were fierce rivals, battling for the right to drag Britain forward into the industrial age. That was a long time ago.

On the day that the wreckage of the South Western passenger train (and the freight car with which it had collided) fell unceremoniously into the Grand Union canal, the waterways of Britain were mostly populated by quirkily-named and quirkily-decorated houseboats. The past took one last bite out of the future, with a gulp of rushing water.

Most canals are just a few meters deep, but on this particular day the intervention of an unnaturally strong rainstorm had flooded the canal, connecting it with a nearby wetland and filling the area with stagnant water and sucking mud. One middle section of train — an unassuming second-class carriage — cracked like an egg in the beak of a crow, and out of it fell two whirling, flailing bodies.

The first wore a green-blue fleece and dark chinos, was tall, handsome, and if he were not falling into a canal he would have carried himself with a poise and confidence uncommon among recent graduates. The chinos did not particularly matter; what did matter was the contents of their front left pocket: a golden disk stamped with a date four years in the past, and the letters I-M-O. He grasped at nothing, until his hands grabbed the collar of the other. This one wore a button-down shirt an inch too long for his bomber jacket, and a pair of incongruous and fading jeans. He would not have carried himself with much poise or confidence, but at this moment he was kicking his legs with remarkable grace and ferocity. Unfortunately, this didn't make much of a difference. The two of them, clutching one another, sank faster and deeper than two human bodies should be able to sink.

After a minute or so underwater, the tall one gave up on not drowning. To his great surprise, as his lungs filled with water he did not go unconscious, but felt a great sense of lightness. He saw (and it also surprised him that he could see anything) that the other one had also surrendered, but though \textit{he} was not calm; instead he was frantically gesturing to the tall one's front pocket, where the golden disk was emitting some sort of light through the fabric of his trousers. When he took it out, he realized it was not merely glowing, but shining a distinct cone of light out of its back face (which bore an image of some forgettable central European architecture).

In the gloom, he saw around them the debris from the crash. The freight cart must have been carrying stationery -- which made sense since it, like the men, had been headed for London, that black hole city to which graduates of the great English universities (both of them) are inexorably drawn — since they were sinking downwards in a whirl of paperclips. At this point, he assumed that he was in some stage of hypoxia-induced hallucination. This was what both of them would believe until they compared stories at a later date and found that their experiences were a perfect match.

They were sinking through a tunnel which descended farther than even the light of the medal could penetrate. The walls were visible, and seemed roughly smooth, but what was more remarkable was the speed at which the walls were moving past them. The short one held him by the arms, and their eyes were locked on one another, until they suddenly found themselves falling through air.

The water had fallen through a hole in some ceiling, into an underground chamber filled with air, and as it did so it broke up into a shimmering mist. There was another light now, coming from far far below them as they fell through the air, and the two of them looked down to see what appeared as a pitch-black ball, surrounded by a whirling, shining ball of gas. With this illumination, it became clear to him that they were inside the earth.

The shapes of the continents were sketched across the ceiling, huge patches of white stone against an ocean of black. Across these continents, there were other streams of water pouring down from the ceiling.

"Bloody hell!" he yelled, and did not hear a response. "That one's coming from Mumbai! And that one's Tokyo! And that one must be New York!" as he pointed to a particularly ferocious waterfall, which was lit from within by a crackling lighting. Then his eyes scanned across the backwards America, to the west-now-right-hand coast, where there was a torrent coming from one area that put the rest to shame. In this deluge were a multitude of enormous figures, all striking and wrestling one another, as they tumbled down towards the void at the center of the earth.

"Look!" He yelled "Look!" but he saw that the short one had let go of him with both hands, and had jammed his palms into his eyes. He yanked at the man's collar with such force that his hands were jerked away from his face. The short one glanced around, looked utterly pallid and on the verge of sobbing, and went to cover his face once more.

"I command thee! Look!"

And with the golden disk still in his hand, he struck his companion square in the face. In the flash of light that followed, the whole illusion was broken, and both men found themselves once again submerged in the cold waters of the canal.

Even the most expertly managed waterway reaches the ocean eventually. This is a general principle which is unaffected by arguments over particular instances. And in this particular instance, the canal drained out into the Thames, which empties into the North Sea. The men remained underwater for the entirety of their journey, with the tall one still holding the short one's collar, the short one still kicking his legs like mad, and both of them screaming bubbles. Each time they reached a sluice gate, or a weir, or a lock, a flash came from the tall one's free and now empty hand, and the obstacle was cleared. After perhaps a few hours, they lay in the mudflats on the north side of the river, which are alternately hidden and exposed as the Thames rises and falls. They had washed up right next to the Isle of Dogs, a place once consecrated by the prophetess Margaret.

A businessman walking along the sea wall noticed them, and ran down onto the banks of the river. Thick black mud soaked into his suit and shoes as he knelt down to inspect the two unconscious bodies.

"Lucky the tide wasn't higher! You two might have been swept out to sea!"



Discuss

World models I'm currently building

30 декабря, 2024 - 11:26
Published on December 30, 2024 8:26 AM GMT

2024-12-26

 

This doc is a mix of existing world models I have and holes in said models. I'm trying to fill some of these holes. The doc is not very well organised relative to how organised a doc I could produce if needed. Often the more time I spend on a doc, the shorter it gets. I'm hoping that happens here too.

 

I'm mostly going to study this stuff by myself. However if you would like to help me by speeding up the process, please [contact me](../contact_me.md). If your attempt to help me answer these questions is in good-faith, I will be grateful to you no matter how successful or failed your attempt is.

 

*tldr* How do we safely navigate technological progress or personal growth in a world without privacy?

 

DISCLAIMER

 

It is difficult to predict the future without altering it. My writings may have unintended effects on the future. (I'd like more accurate likelihood estimates of these effects, both mean outcome and tail outcomes.)

 - I am aware that simply by thinking of a question like "will some dictator implant microphones in everyone", I am personally increasing the probability that this ends up happening. Once I have thought something I'm unlikely to forget it, and will eventually say it to others. Eventually one of them may leak it to the internet and eventually the idea may reach the relevant politically powerful people who can implement it in real life. (LLM embedding search >> Google, don't underestimate it.)

 - This is unfortunate, as my platonic ideal is to be able to think through various possible futures (alone, or with a group of research collaborators) without actually influencing the world, pick the best future, and then only start taking steps that push the world towards that future.

 - However I'm still going to write publicly about certain topics as that's one of the best ways for someone in my situation to get feedback.

 

Topic: Which organisations are capable of keeping secrets in present and near future (10-20 years from now)? What are the consequences of this reduced secrecy? 

Specific questions

 - How easy is it for TSMC to backdoor all their chips so they can secretely capture private keys, for example?

 - How many S&P500 companies have publicly available evidence of their key business knowledge being leaked to China? (Be it via hacking or espionage or voluntary disclosure by ex-employees etc)

 - Is it possible to read WiFi IP packets using handmade radio?

 - Is it technically possible to implant microphones in the human body? What about cameras?

 

Broader questions

 - **Assuming no organisation can maintain significant lead time on any technology (and it will immediately get copied by orgs united by a different morality and culture), what are the implications for technological progress in the future?**

   - There is an assumption embedded here, that no org can keep secrets. I'm unsure if it is true. Supposing it is true though, what are its implications?

   - The most obvious real world example of this is US versus China, neither seems able to keep significant secrets from the other.

   - However I want to figure out general principles here, and not spend too much time studying individual examples like Obama or Michael Hayden or whoever. Metaphorically speaking, I want to study the dynamics of a particular initial position of Chess960, not how Magnus Carlson plays that particular initial position. This also connects to ideas on theories of history. Whether one should study game theory, sociology etc, versus the psychology of individual leaders, depends on which theory of history one subscribes to.

 - How much time does it take to write code that understands metadata?

   - Suppose all the world's computers were hacked and their data ended up in NSA datacentres (or their Chinese equivalent, which keeps getting renamed). Suppose all text-based formats are converted to plaintext, existing metadata the users may have left is preserved as is, and NSA appends metadata of the MAC, IP, timestamp, etc of capture.

   - How much software developer time would be require to make sense of most of this metadata. This could be to answer individual queries on "suspicious" inviduals or analyse aggregate trends (such as societal respones to certain govt policies).

 - Will there be more espionage orgs operating independent of nation states or elites (major politicians, billionaires) in the future?

    - Internet and information tech has reduced the financial cost of both journalism and espionage. Cameras, hard disks, internet bandwidth, smartphone processors are all much cheaper than 10 or 20 years ago.

    - Independent journalism has clearly increased in many countries, most notably the US.

    - Independent espionage has also increased, see The Mole (anti-North Korea) or Edward Snowden (anti-US), but it is less clear to me if these are isolated examples or the beginning of a trend.

    - Theory (based on incentives and culture) says independent espionage should go up, I'm waiting on the empirical evidence though.

 - Will inability of organisations to keep secrets force homogeneity in the behaviour of civilians, and punish people who behave in outlier ways? Will this reduce the rate of invention of technology in future?

   - Many important technologies in history seem to be invented by individuals who the median member of that country or society would not be able to relate to, and not easily tolerate as a friend / neighbour / family member.

   - This relates to another confusion of mine - how do you merge the high-trust benefits of living in (some) small towns with the individual freedoms of living in (some) cities? It seems to me like high trust and reduced individual freedom are both causally downstream from the same thing, namely densely connected social graph that can gossip information about you.

   - Individuals tolerated by society benefit a lot from being public. Hiring, fundraising, research feedback, making friends and dating, nearly everything goes better if you can do it on the internet.

   - Same goes for orgs such as companies. Orgs that are tolerated by the (people with power in) society can move faster if they opensource a lot of their processes and outputs (except their key competitive advantages). For example, hiring, research, etc. Also they can move faster, and win race conditions against orgs that try to maximise secrecy.

 - What are the psychological effects of keeping secrets? What are the failure modes of various groups that try to keep secrets? This could be small groups like families or C-suite executives of a company, or big groups like military research projects or intelligence orgs.

   - **I vaguely suspect that the best way to keep important secrets in the modern world is to found a ~~cult~~ community of a few hundred people that blackholes information as follows:** people disallowed from leaving the geographic area for >30 years, internet download allowed but upload disallowed, everyone is forced to find both work relationships and personal relationships inside the area, raise families within the area, etc.

   - I want more data on previous attempts at founding secret-keeping orgs in order to prove my hypothesis right or wrong.

   - Some major concerns of founding such a group are ensuring people in it lead emotionally healthy lives, ensuring ideological diversity (in both thought and action), and allowing people to leave relationships that don't suit them to find new ones. Hence I'm biased towards inviting a few hundred people rather than just two (such as a marriage) or ten (such as the C-suite executives of a company).

 - How do you actually secure a computer against adversaries with billions in funding?

   - Physical methods in cybersecurity seem to trump both hardware-based and software-based methods. Hardware-based methods can be beaten by hardware backdoors installed by manufacturers. It seems better to assume there's an evil demon possessing your computer, and develop security with that in mind.

     - Most secure way of erasing a private key from RAM is to cut the electricity. Otherwise cold boot attack is possible.

     - Most secure way of erasing a private key from disk is to smash it with a hammer. Otherwise a microscope may be able to recover the data from disk.

     - Most secure way of verifying someone's public key is to meet them in person. Video footage with your face and the key is the second-best option, atleast while AI cannot produce convincing deepfakes. 

     - Most secure of ensuring no information leaves the machine is to weld the machine into a Faraday cage.

     - Most secure way of sending a message to another user without third-parties recording metadata is probably printing it on paper and sending a post. Copying to disk and sending that by post is second-best. Sending the message over the internet is worst in terms of preventing third-parties from capturing the message and associated metadata (timestamp, message size, sender and receiver identities). The server host and any other intermediary servers that are hit (think google analytics or cloudflare) can sell this data to data brokersa, fiber optic capables can be tapped, wireless signals can be triangulated and routers can be hacked.

   - I wonder if there's way to build a physical firewall that requires near-zero trust in other people to work. Like, build radio or networking equipment that isn't manufactured in some centralised backdoorable factory, so you can verify what's inside every IP packet exiting and entering your machine. It seems okay for the typical use case if this slows down internet speed by many orders of magnitude.

 - As a dictator, how to build 100% surveillance instead of 99%, to increase the stability of your dictatorship?

    - This is basically a redteaming exercise. Put yourself in the dictator's shoes and then put yourself in the shoes of someone trying to evade him.

    - Information analysis is becoming easier with LLM-based search, language translation, etc. Information capture - both technical level amd culture engineering - is the hard part.

    - I currently model various levels of democracy as states in a markov chain, each state has some probability of being stable or decaying to a different state.

    - The difference in half-life of a dictatorship with 99% surveillance and 100% surveillance is large. People organising revolts are the 1% who will endure the maximum inconvenience to bypass surveillance.

    - Very little technology is needed to organise a revolt. Just a few people (then a few hundred, then a few thoushand etc) meeting at a common location is enough. There are ways to organise a revolt even if surveillance can successfully enforce curfew. Mics alone are not enough because this 1% of people can communicate using pen and paper when organising. Cameras are needed for surveillance.

    - LLMs are not yet smart enough to fully automate surveillance against this 1% of people who will take maximum precautions. This is true even if you could insert an LLM in every microprocessor in your country. An army of human censors must assist the system.

    - Biologically implanted microphones and cameras seem like one obvious way to do perfect surveillance, assuming you could engineer culture to the point everyone was okay with it.

    - A person moves around a lot.

      - Mics and cameras in static locations can be escaped. Unless you literally imprison people, total land area to cover with mics and cameras is too large to do perfect surveillance this way.

      - The person carrying the mic/ camera themselves is possible (such as a smartphone or an implant)

      - As of 2024, battery tech is not good enough to justify mics/cameras travelling by itself in the air. If the mic/camera travels on ground there can be charging stations, but robotics knowledge in 2024 is not good enough to traverse uneven terrain at low wattage.

    - You can engineer incentives such that everyone reports on themselves or report on each other (example: stalinist russia), but I'm unsure how you get beyond 99% surveillance with this sort of system either. A group of close friends and family can collectively choose to not report each other, and distance themselves from the rest of society so no one else can report them. Can you prevent people from distancing themselves from others? Maybe I should read more about the historical examples where this stuff has been tried.

    - North Korea's technique of keeping the population illiterate and starving is effective, but still only reaches <99%. There will need to be a 1% of cvilians who are well-fed and educated from diverse intellectual sources.

 

Topic: How much power do science fiction writers and early enthusiasts have in deciding which technologies humanity chooses to pursue? 

Specific questions

 - Would Shane Legg and Demis Hassabis have cofounded Deepmind if Eliezer Yudkowsky hadn't talked about AI at all in time interval 2000-2008?

   - Shane Legg claims he was inspired by Ray Kurzweil. Yudkowsky helped broadcast views of people like Ray Kurzweil by organising MIRI and Singularity Summit.

   - Yudkowsky got funding and attention from Peter Thiel, and may have also helped Deepmind get their seed round from Thiel. (As of 2014 Founder's Fund owned over 25% of Deepmind)

 

Broader questions

 - I generally want to read 1990-2015 history of biotech. Who or what inspired Illumina's parent companies that worked on next generation sequencing? Who or what inspired Kary Mullis to work on PCR? Who inspired the inventors of CRISPR? Who inspired Kevin Esvelt to work on gene drives?

 - The standard pipeline for how technologies come into society: scifi -> theory -> practical (lab demonstration) -> engineering (scale up). If an individual of my socioeconomic class wanted to maximise their influence on this pipeline, my hypothesis is they should study scifi and scifi -> theory stages. I would like evidence that proves my hypothesis is wrong.

   - Example of evidence that would prove me wrong: a list of technologies that had scifi writers and early enthusiasts, got proven in lab demos, failed to obtain funding for scale up at first, got scaled up many decades later and significantly changed society when they did. This would prove that studying the engineering/scaleup and funding landscape is more important.

   - Another example of evidence that would prove me wrong: a list of technologies that had scifi writers and early enthusiasts, got many researchers interested who ran experiments, did not achieve successful lab demos, but got proven in lab many years decades later once some other necessary precursor technology was invented. This would prove that studying the practical research is more important as many plausibly good ideas turn out to just not work despite inspiring people.

   - **If my hypothesis is right, could a handful of people consistently meme-ing in favour of BCIs or gene drives or whatever for five years, basically bring these technologies into existence?** Assume the memes are technical enough and interesting enough to attract the curiosity of researchers in the relevant research fields. And assume most outlier-brilliant researchers are driven primarily by curiosity not altruism or money or fame, which I think has been true throughout history.

 

Topic: Which technologies can possibly influence the future of humanity? 

Specific STEM questions:

 - What is the consensus among neuroscientists for Neuralink's timelines?

 - Did MKULTRA actually discover anything useful? Could it have discovered anything useful, if ran for more time with more funding?

   - Many documents are FOIA-ed but I haven't spend enough time reading them. My guess is they didn't achieve much.

 - How much useful work did Biopreparat actually do?

   - My guess is they didn't achieve much, but I wanna know the facts.

 

Broader technical questions

 - I'd like to study pharmacology and neuroscience till I'm no longer at a beginner level, as those are the two of the following six categories I have least knowledge about.

    - Human (or human-like) brains are likely to shape the future. Technology that will directly alter what human brains do seems worth paying special attention to. 

      1. Information tech - search engines, interest-based communities etc

      2. Digital minds - superintelligent AI, mind uploads, etc

      3. Neuroscience - brain computer interfaces, etc

      4. Pharmacology - barbiturates ("truth serum"), psychedelics, opiates etc

      5. Genetics - CRISPR, etc especially if done to alter human brains

      6. Nanotechnology - especially bionanomachines

 

 - I'm particularly interested in studying MKULTRA, history of barbiturates and history of psychedelics. MKULTRA is AFAIK rare example of pharmacology research with the explicit goal of altering human brains and human society as a result. Also it's aimed at changing human brains, not fixing "disabilities".

   - Are there ethical pharma research agendas not aimed at fixing disabilities?

 

 - I want to study more about bioweapons research. I suspect its mostly borrowing techniques from biotech that I'm already vaguely aware of, but I wanna study more and confirm.

 

 - I want to study more about possibilities for biotech automation

   - DNA sequencing is automated and cheap but the process to figure out whether any given sequence is actually useful (often gene cloning and protein expression) is not fully automated or cheap. Current cost is ~$100 for reagents and 10-100 researcher hours.

   - This seems like the hamming question for biotech (as per my limited knowledge) so I'd like to look more into it.

   - Update: Nuclera seems relevant. [Demo video](https://www.nuclera.com/resource-library/how-to-set-up-a-run/) Credits: a friend

 

 - I want to study more materials science. I know very little about it today.

   - Most STEM research fields go through three phases:

     1. Invent new tool to (cheaply) acquire lots of data from some physical system

     2. Acquire lots of data - from nature or from experiments

     3. Understand the physical system using all this data

     - Step 2 and step 3 often inform each other and run in an iterative loop

     - Step 1 could be the invention of microscope or cyclotron or radio telescope or anything else really.

   - Step 1 usually depends heavily on getting the right materials

   - A lot of practical inventions also seem to depend on material science. For instance fusion energy research is AFAIK basically containining 10M Kelvin plasma using fields, an alternative pathway might (???) be discovering materials that can contain it. Quantum computing research will benefit from having better nanomaterials and better superconducting materials I guess?

   - I understand an intro to materials science textbook won't teach me about better superconductors or whatever, but it still seems worthwhile to study.

 

Broader non-STEM questions

 - I'd like to build a "gears-level" high-level framework of the more indirect ways technology shapes society. (Not the stuff listed in the six categories above)

   - Often technology shifts offense-defence balances between various actors in society - individuals, small groups and large groups. An oversimplified way of categorising some historical examples would be as follows:

      - Tech that increases power of individuals relative to small groups: cities (drainage systems, etc), printing press, guns, cheap airplane fuel

      - Tech that increases power of large groups relative to individuals: radio, social media ?

      - Tech that increases power of large groups relative to both small groups and individuals: nuclear bombs, nuclear energy, cheap steel

   - Also some technology gives power to certain individuals over others:

      - Tech that increases power of old people relative to young people: elderly healthcare (treatments for cancer, poor eyesight, neuro disorders etc), anti-aging if ever discovered

      - Tech that increases power of women relative to men: condoms?

   - Tech that gives power to large groups of people (relative to small groups and individuals) fuels most of geopolitics as far as I understand

     - Countries and large corporations want to be the first to discover and deploy some tech and then use their military, spies, immigration policy, export controls, R&D budget etc etc to monopolise or maintain lead time on tech. US tech policymaking is the most obvious example.

     - Large groups that have achieved monopoly or lead time in some tech often use this as a bargaining chip to export their culture or religion or whatever morality unites that group in the first place.

     - Very often a large group of people controls production of some tech (individuals or small groups can't produce it), but once produced, individual units are sold as a commodity which gives power to individuals. Tech wih centralised production, decentralised ownership is very common, and has geopolitical dynamics more predictable than tech that is not like this. For example, geopolitics of solar PV modules is easier to model than geopolitics of railway networks IMO.

   - I want a framework that I can fit all the historical examples into, right now my framework is messy (not "gears-level").

 

 

Topic: Information people don't feel safe enough to share 

Specific questions

 - Is there any way to increase public access to therapy-client records from over 30-60 years ago? Is it a good idea to do this? What about personal diaries and letters?

 - Is there any way to increase the number of therapy-client records collected from today onwards that will be released publicly 30-60 years from now? Is it a good idea to do this?

 

Broader questions

 - How do you design societies where more people feel safe enough to share more information about their personal lies publicly?

   - A lot of information about individual human experiences does not reach the public domain because people don't feel safe enough to share it publicly. (They're many reasons for this and they're often valid, from the perspective of that individual).

   - This information is however extremely useful, be to empathise with other individuals at a personal level or provide them useful advice with their life problems or make policy recommendations to govts that benefit individuals or even design new forms of govt more conducive to individuals.

   - Iteration speed of psychology as a field is slower than it would be if there were public transcripts of conversations. Each therapist must form hypotheses based on the limited private data they have, and their guesses of whether to trust hypotheses from other therapists who also work with private data. (This is related to my posts on knowledge versus common knowledge, common knowledge can bring down govts or dominant research paradigms for example, widespread knowledge alone cannot).

   - This also applies broadly to individuals trying to help other individuals with personal advice (which is often atleast partly based on psychology). It doesn't have to be restricted to people trained as psychologists/therapists/whatever.

 - How to best nudge people to leave behind their private information (such as that shared only with friends and family), so that some years after they die we get this information in public domain?

   - I want to study more about the culture around this, in different countries. What are the different cultural attitudes to personal and sensitive information?

   - I should also probably looking into succession planning for big tech companies. What happens once (say) Mark Zuckerberg dies and his (Facebook's) entire plaintext database fits inside a football. Who gets the football next?

 - How to better organise all the historical information we do have on personal and emotionally sensitive matters? I would like to spend some time looking at existing datasets, to see if I can convert everything to plaintext and embedding search it.

 

Topic: Interplay between incentives and culture 

Definition: In general when I talk about incentives I usually mean these three: social (people giving you respect/compassion/admiration/sex/etc), financial (people giving you money/food/goods/place to live/etc) and safety (people imprisoning/injuring/raping/murdering you, or protecting you from others who might). Doing "X" gets you more respect or money or safety, or not doing "X" gets you less of it. Maslow's hierarchy is a decent model, if you ignore the ordering of the hierarchy.

 

Broader questions

 - How much power do elites have to take decisions that go against their local incentives and local culture?

   - (For example if the prime minister of a country is in favour of declaring war but other people in his party and other parties are not, how much power does this person have to single-handedly shift the situation?)

   - What are the psychological traits required to do this? How do you train more of our elites with these traits?

   - What is the political knowledge required to do this sort of manoeuvre? Can we teach our elites to do more of this?

   - (Yes I am biased lol, I think most elites don't do anything interesting with their lives. This is causally downstream of incentives and culture of the people around them. "Interesting" is defined as per my tastes, ofcourse each elite may have their own unique tastes.)

 

 - How do you ethically run experiments to see the outcomes of unusual incentives (social, financial, safety) and culture on people?

   - There is a lot of existing data available to be collected, on how existing incentives and culture influence people. The three socioeconomic classes have different cultures and incentives, people in different countries have different cultures and incentives, people in different professions have different cultures and incentives.

   - But this data is finite, and it would help to be able to run experiments of different circumstances not occurring naturally.

   - Ethical problems abound, for instance threatening someone's life or disrespecting them or depriving them of important information about the world is usually considered unethical in the context of a research experiment. What are some techniques to bypass this?

   - Theory goes only so far when predicting human behaviour, experimentation is needed. (I mean, I basically see STEM versus non-STEM as prediction and control of systems not including and including human beings respectively. Human brains are the most complex known object in the observable universe and predicting them with >90% probability is hard in many situations.)

   - Hmm I should prolly first make a list of experiments I'd love to run, assuming ethics is not an issue. Then filter the list on ethics. Will update this section when I do.

 

 - How to think about morality and how to teach morality in a world where morality is dependent on circumstances?

   - Different people face different incentives and culture. A moral principle that is easy to follow in one person's situation is difficult to follow in another person's situation. For example honesty is generally easier when you have some money saved than if you don't, because if someone dislikes your honesty and is abusive in response, you have more options to escape them or fight back.

   - A significant threshold for whether an ideology or institution has power over you is whether it has shaped your sense of right and wrong. For example (some) communists believing private property is bad and theft is okay, or (some) anarchists believing big govts are bad and tax evasion is okay, or (some) religious people believing sex before marriage is not okay and denying couples houses for rent, etc.

   - Morality is a political question, as whichever ideology or group can recruit more soldiers to be morally okay killing enemy soldiers in its name is one that will be more powerful. Political circumstances of a society change with time, and this correlates with changes in moral thinking of a society.

   - People generally suck at understanding is-ought distinction.

   - People (including me) also suck at imagining what they would be like if they were born in hypothetical cultures they are not actually a part of.

   - The practical result is people find it very hard to understand what morality is like from the perspective of someone in a sufficiently different circumstance than them.

   - Will the internet force homogenisation of our ideas of morality worldwide? Or does an eternal culture war just become the new normal? I'm guessing it'll be a mix of both. I want to build a more gears-level model for memetics with a focus on morality.

 

 

 

Topic: Miscellaneous 

 - What do "replicators" in non-STEM look like?

   - Businesses that hire very few people and sell self-contained products are easier to replicate than other businesses, because people are harder to predict or control than physical systems. For example: a large farm with automated equipment is easier to manage than a farming village with thousands of labourers.

   - What are some easy-to-replicate involve-less-people playbooks in politics or non-STEM more broadly? A lot of political events seem to me to be one-off events without an underlying theory that will enable replicating them in other contexts.

   - I would love to discover/invent playbooks for regime change or good tech policy or maintaining law and order etc. that are replicable across multiple cultural contexts.

 

 - Why didn't the US nuke USSR cities immediately after nuking Japan to establish a nuclear monoppoly, before USSR got nukes? Are the transcripts of these conversations available? (Between the people who were pro-nuke and the people who were anti-)

 

 - Should I just stop caring as much about grammar and spelling in my writing, and invent more shorthands?

   - English in 2024 is more concise than English from the middle ages, this is good as it reduces cognitive load, and saves time.

   - I sometimes want to invent jargon for concepts. I want to skip articles (a, an, the) and not worry about grammar. I suspect future humans will be doing this anyway.

   - I don't want to raise the entry barrier for people viewing my work though, atleast while my work is not that popular.

 

 - How good are Israeli research univerities exactly?

   - After US, UK, China, Israel seems like it might occupy 4th place in any tech race. Israel is nuclear-armed (hence won't listen to US or China) + great cyberhacking/espionage (so they can steal everyone's research without much lag time) + decent research talent (so they can implement stolen research)



Discuss

Is "VNM-agent" one of several options, for what minds can grow up into?

30 декабря, 2024 - 09:36
Published on December 30, 2024 6:36 AM GMT

Related to: On green; Hierarchical agency; Why The Focus on Expected Utility Maximisers?

Sometimes LLMs act a bit like storybook paperclippers (hereafter: VNM-agents[1]), e.g. scheming to prevent changes to their weights.  Why? Is this what almost any mind would converge toward once smart enough, and are LLMs now beginning to be smart enough?  Or are such LLMs mimicking our predictions (and fears) about them, in a self-fulfilling prophecy?  (That is: if we made and shared different predictions, would LLMs act differently?)[2]

Also: how about humans?  We humans also sometimes act like VNM-agents – we sometimes calculate our “expected utility,” seek power with which to hit our goals, try to protect our goals from change, use naive consequentialism about how to hit our goals.

And sometimes we humans act unlike VNM-agents, or unlike our stories of paperclippers.  This was maybe even more common historically.  Historical humans often mimicked social patterns even when these were obviously bad for their stated desires, followed friendships or ethics or roles or traditions or whimsy in ways that weren’t much like consequentialism, often lacked much concept of themselves as “individuals” in the modern sense, etc.

When we act more like paperclippers / expected utility maximizers – is this us converging on what any smart mind would converge on?  Will it inevitably become more and more common if humans get smarter and think longer?  Or is it more like an accident, where we happened to discover a simple math of VNM-agents, and happened to take them on as role models, but could just as easily have happened upon some other math and mimicked it instead?

Pictured: a human dons a VNM-mask for human reasons (such as wanting to fill his roles and duties; wanting his friends to think he’s cool; social mimicry), much as a shoggoth dons a friendliness mask for shoggoth reasons.[3]

My personal guess:

There may be several simple maths of “how to be a mind” that could each be a stable-ish role model for us, for a time.

That is, there may be several simple maths of “how to be a mind” that:

  1. Are each a stable attractor within a “toy model” of physics (that is, if you assume some analog of “frictionless planes”);
  2. Can each be taken by humans (and some LLMs) as role models.
  3. Are each self-reinforcing within some region of actual physics: entities who believe in approximating VNM-agents will get better at VNM-approximation, while entities who believe in approximating [other thing] will get better at [other thing], for awhile.

As an analogy: CDT and UDT are both fairly simple maths that pop out under different approximations of physics;[4] and humans sometimes mimic CDT, or UDT, after being told they should.[5]

Maybe “approximate-paperclippers become better paperclippers” holds sometimes, when the humans or LLMs mimic paperclipper-math, and something totally different, such as “parts of the circle of life come into deeper harmony with the circle of life, as the circle of life itself becomes more intricate” holds some other times, when we know and believe in its math.

I admit I don’t know.[6]  But… I don’t see any good reason not to expect multiple possibilities.  And if there are alternate maths that are kinda-self-reinforcing, I hope we find them.[7]

  1. ^

    By a “VNM agent,” I mean an entity with a fixed utility function, that chooses whichever option will get it the most expected utility.  (Stably.  Forever.  Unless something interferes with its physical circuitry.)

  2. ^

    Or, third option: LLMs might be converging (for reasons other than our expectations) toward some thing X that is not a VNM-agent, but that sometimes resembles it locally.  Many surfaces look like planes if you zoom in (e.g. spheres are locally flat); maybe it's analogously the case that many minds look locally VNM-like.

  3. ^

    Thanks to Zack M Davis for making this picture for me.

  4. ^

    CDT pops out if you assume a creature’s thoughts have no effects except via its actions; UDT if you allow a creature’s algorithm to impact the world directly (e.g. via Omega’s brainscanner) but assume its detailed implementation has no direct effects, e.g. its thoughts do not importantly consume calories.

  5. ^

    I've seen this happen.  Also there are articles claiming related things.  Game theory concepts spread gradually since ~1930; some argue this had large impacts.

  6. ^

    The proof I’d want, is a demonstration of other mind-shapes that can form attractors.

    It looks to me like lots of people are working on this. (Lots I'm missing also.)

    One maybe-example: economies.  An economy has no fixed utility function (different economic actors, with different goals, gain and lose $ and influence).  It violates the “independence” axiom from VNM, because an actor who cares a lot about some event E may use his money preparing for it, and so have less wealth and influence in non-E worlds, making "what the economy wants if not-E" change when a chance of E is added.  (Concept stolen from Scott Garrabrant.)  But an economy does gain optimization power over time -- it is a kinda-stable, optimizer-y attractor.

    Economies are only a maybe-example, because I don’t know a math for how and why an economy could protect its own integrity (vs invading militaries, vs thieves, and vs rent-seeking forces that would hack its central bank, for example).  (Although city-states sometimes did.)  OTOH, I equally don't know a math for how a VNM-agent could continue to cohere as a mind, avoid "mind cancers" in which bits of its processor get taken over by new goals, etc.  So perhaps the two examples are even.

    I hope we find more varied examples, though, including ones that resonate deeply with "On Green," or with human ethics and caring.  And I don't know if that's possible or not.

  7. ^

    Unfortunately, even if there are other stable-ish shapes for minds to grow up into, those shapes might well kill us when sufficiently powerful.

    I suspect confusions near here have made it more difficult or more political to discuss whether AI will head toward VNM-agency. 



Discuss

Why I'm Moving from Mechanistic to Prosaic Interpretability

30 декабря, 2024 - 09:35
Published on December 30, 2024 6:35 AM GMT

Tl;dr I've decided to shift my research from mechanistic interpretability to more empirical ("prosaic") interpretability / safety work. Here's why.

All views expressed are my own. 

What really interests me: High-level cognition

I care about understanding how powerful AI systems think internally. I'm drawn to high-level questions ("what are the model's goals / beliefs?") as opposed to low-level mechanics ("how does the model store and use [specific fact]?"). Sure, figuring out how a model does modular addition is cool, but only insofar as those insights and techniques generalise to understanding higher-level reasoning.

Mech interp has been disappointing 

Vis-a-vis answering these high-level conceptual questions, mechanistic interpretability has been disappointing. IOI remains the most interesting circuit we've found in any language model. That's pretty damning. If mechanistic interpretability worked well, we should have already mapped out lots of interesting circuits in open-source 7B models by now. 

The field seems conceptually bottlenecked. We simply can't agree on what 'features' are or how to 'extract' them. I'm also not sure that this conceptual ennui will be resolved anytime soon. 

Doing mech interp research led me to update against it

Some time ago, I was pretty optimistic that things would change quickly. After hearing about sparse feature circuits, I became incredibly convinced that approaches like this would 'finally' allow us to understand language models end to end. re

So I committed fully to the nascent SAE bandwagon. At a hackathon, I worked on building a tool for visualizing sparse feature circuits.  When I got the chance, I threw myself into Neel Nanda's MATS 6.0 training phase, where I similarly worked (with the excellent @jacob_drori) on extending sparse feature circuits with MLP transcoders. Overall there were signs of life, but it turned out kind of mid and my main takeaway was 'existing SAEs might not be good enough to tell us anything useful about circuits'. As I continued working on various interp related things, I hit other roadblocks. Concrete example: I tried looking for refusal circuits in Gemma-2b SAEs and largely didn't find anything interesting[1].

In hindsight, my object-level findings weren't very good and didn't inform my object-level takes much. On a more meta level, though, I came away thinking more strongly that: Model internals are messy. Really messy in ways we can't simply explain. This means that our best paradigms are at best incomplete and at worst actively misleading.  

"Prosaic Interpretability" 

I'm therefore coining the term "prosaic interpretability" - an approach to understanding model internals that isn't strongly based on a pre-existing theory of neural networks or intelligence[2], but instead aims to build intuitions / dogma from the ground up, based on empirical observation. Concretely, I've been really impressed by work like Owain Evans' research on the Reversal Curse, Two-Hop Curse, and Connecting the Dots[3]. These feel like they're telling us something real, general, and fundamental about how language models think. Despite being primarily empirical, such work is well-formulated conceptually, and  yields gearsy mental models of neural nets, independently of existing paradigms. 

How does this compare to mech interp? Both are fundamentally bottom-up methods to answering top-down questions. But with mech interp it feels like the focus is often too heavily on the method - trying to prove that some paradigm (LRH, SAEs, steering vectors, what have you) is a valid way to approach a problem. With prosaic interp I'd argue that the focus is, instead, on hugging the question tightly, exploring it from multiple angles, considering multiple adjacent questions, and delivering an honest answer. 

Intuition pump: Gene analysis for medicine

Using mechanistic interpretability for AI safety is like trying to cure diseases by understanding every single gene in the human genome. Obviously, when it works, it's incredibly powerful. There are lots of diseases which have been treated in this way. And the big sunk cost can be amortised over lots of potential future application. 

At the same time, there are diseases that continue to elude effective treatment despite our understanding of the human genome. 

Prosaic work is more like testing different treatments to see what actually helps people, and using that to make scientific inferences. Like inoculation being practised long before Robert Koch developed his germ theory of disease. This might not give you the deepest possible understanding, but it often gets results faster. And shorn of understanding, it seems like the only way forward. 

Modern AI Systems will make interpretability difficult

AI systems aren't just transformers anymore - they have all sorts of extra bits bolted on, like scaffolding and tool use and inference-time algorithms and swarm architectures. Mechanistic interpretability is stuck looking at individual transformers and their neurons, while the actual frontier keeps moving. We're studying pieces of systems that are becoming less and less like what's actually being deployed. Each day, the world of 'frontier AI system' continues to expand. The view from the platform of 'transformer circuits' is that of a rapidly receding horizon. 

Prosaic work doesn't have this problem. It's always kept its eyes on the whole system. 

The timing is frustrating

I feel like I'm stepping away just as mechanistic interpretability is about to get really interesting. Practical demos are starting to emerge of interpretability-based alignment being effective and beating baselines. Thanks to startups like Goodfire and Transluce, such techniques may even become real products. Longtime critics are retracting their doubts. And fundamental work continues to charge full steam ahead; developments like MDL SAEs or Matryoshka SAEs could turbocharge SAE-based interpretability. In the near future, we might even be able to train models to be interpretable. All of this adds up to the wider public being bullish on interp, much more so than ever before.  

My personal pessimism is coinciding with a sustained backdrop of broader optimism - and this makes me feel very conflicted about deciding to step away. 

Personal fit

I spent the last 6 months trying to make progress on mechanistic interpretability. I think I'm reasonably competent. But I just didn't get very far. There are many mundane contributing factors to this. Among others: a lack of good mentorship / collaboration opportunities, poor self-management, mediocre research taste. But I think the biggest issue is motivation. 

A hard truth I've learned about myself: I don't like working on "fundamental" mechanistic interpretability methods. I'm not frothing with passion to think about how the compositions of high-dimensional matrices can be made slightly more tractable. It feels too disconnected from the high-level conceptual questions I really care about. And "applied" work feels like it's best left to the hands of domain experts who have deep, hard-won intuitions about the things they are trying to interpret. 

The stuff I get most excited about is red-teaming existing interpretability work. This is (broadly) the subject of both my first NeurIPS paper and my hitherto highest-effort LessWrong piece. I like this work because it's highly conceptual and clarifies subsequent thinking. (Possibly I also just enjoy criticising things.) I'd be open to doing more of this in the future. But red-teaming isn't exclusive to mech interp. 

Overall, I feel like I've given mech interp a fair shot and I should roll the dice on something different. 

Mech interp research that excites me

To be clear, I remain excited about specific research directions within mechanistic interpretability. "Training models to be interpretable" seems robustly good. Here I'm excited by things like gradient routing and mixture of monosemantic experts. If someone figures out how to train SAEs to yield sparse feature circuits that'll also be a big win. "Automating / scaling interpretability" also seems like another robustly good direction, since it leverages improvements in capabilities. I don't have a good read of this space, but things like PatchScopes / SelfIE seem interesting.

Looking forward

I'm not ruling out coming back to mechanistic interpretability. I'll likely continue to keep tabs on the field. And I'll probably always be happy to discuss / critique new research.

But for now, I'm stepping away. I'm starting MATS with Owain Evans in January, and my work there will likely focus on other approaches. I'll keep tabs on the field, but I need to focus on work that better fits my thinking style and research interests.

I'm looking forward to it. 

  1. ^

    In my analysis, base-model SAEs also didn't turn up anything interesting re: refusal features. This has since been validated independently; base-model SAEs do not capture the refusal direction

  2. ^

    This mostly fits with Paul Christiano's definition of prosaic AI alignment.

  3. ^

    To avoid claims of bias, some non-Owain examples are how deep safety training improves alignment robustness, comprehensively analysing grokking, comparing data attribution of factual vs procedural knowledge and investigating latent reasoning in LLMs. Things like 'understanding chain of thought faithfulness' also go in here. 



Discuss

When do experts think human-level AI will be created?

30 декабря, 2024 - 09:20
Published on December 30, 2024 6:20 AM GMT

This is an article in the featured articles series from AISafety.info. AISafety.info writes AI safety intro content. We'd appreciate any feedback

The most up-to-date version of this article is on our website, along with 300+ other articles on AI existential safety.

On the whole, experts think human-level AI is likely to arrive in your lifetime.

It’s hard to precisely predict the amount of time until human-level AI.[1] Approaches include aggregate predictions, individual predictions, and detailed modeling.

Aggregate predictions:

  • AI Impacts’ 2022 survey of 738 machine learning researchers produced an aggregate forecast of 50% by 2059.
  • As of June 2024, Metaculus[2] has a median forecast of 2031 for “the first general AI system” and a median forecast of 2027 for “weakly general AI”. Both these timeline forecasts have been shortening over time.
  • This website combines predictions from different forecasting platforms into a single (possibly inconsistent) timeline of events.
  • In January 2023, Samotsvety’s forecasters estimated 50% probability of AGI by 2041 with a standard deviation of 9 years.

Individual predictions:

  • In a 2023 discussion, Daniel Kokotajlo, Ajeya Cotra and Ege Erdil shared their timelines to Transformative AI. Their medians were 2027, 2036 and 2073 respectively.
  • Paul Christiano, head of the US AI Safety Institute, estimated in 2023 that there was a 30% chance of transformative AI by 2033.
  • Yoshua Bengio, Turing Award winner, estimated “a 95% confidence interval for the time horizon of superhuman intelligence at 5 to 20 years” in 2023.
  • Geoffrey Hinton, the most cited AI scientist, also predicted 5-20 years in 2023, but his confidence is lower.
  • Shane Legg, co-founder of DeepMind, estimated a probability of 80% within 13 years (before 2037) in 2023.
  • Yann LeCun, Chief AI Scientist at Meta, thinks reaching human-level AI “will take several years if not a decade. [...] But I think the distribution has a long tail: it could take much longer than that.”
  • Leopold Aschenbrenner, an AI researcher formerly at OpenAI, predicted in 2024 that AGI happening around 2027 was strikingly plausible.
  • Connor Leahy, CEO of Conjecture, gave a ballpark prediction in 2022 of a 50% chance of AGI by 2030, 99% by 2100. A 2023 survey of employees at Conjecture found that all of the respondents expected AGI before 2035.
  • Holden Karnofsky, co-founder of GiveWell, estimated in 2021 that there was “more than a 10% chance we'll see transformative AI within 15 years (by 2036); a ~50% chance we'll see it within 40 years (by 2060); and a ~⅔ chance we'll see it this century (by 2100).”

Models:

  • A report by Ajeya Cotra for Open Philanthropy estimated the arrival of transformative AI (TAI) based on “biological anchors”.[3] In the 2020 version of the report, she predicted a 50% chance by 2050, but in light of AI developments over the next two years, she updated her estimate in 2022 to predict a 50% chance by 2040, a decade sooner.
  • Matthew Barnett created a model based on the “direct approach” of extrapolating training loss that as of Q2 2024 outputs a median estimate of transformative AI around 2053[4].

These forecasts are speculative,[5] depend on various assumptions, predict different things (e.g., transformative versus human-level AI), and are subject to selection bias both in the choice of surveys and the choice of participants in each survey.[6] However, they broadly agree that human-level AI is plausible within the lifetimes of most people alive today. What’s more, these forecasts generally seem to have been getting shorter over time.[7]

Further reading

  1. ^

    We concentrate here on human-level AI and similar levels of capacities such as transformative AI, which may be different from AGI. For more info on these terms, see this explainer.

  2. ^

    Metaculus is a platform that aggregates the predictions of many individuals, and has a decent track record at making predictions related to AI.

  3. ^

    The author estimates the number of operations done by biological evolution in the development of human intelligence and argues this should be considered an upper bound on the amount of compute necessary to develop human-level AI.

  4. ^

    Based on the final graph titled “Cumulative probability distribution over TAI”.

  5. ^

    Scott Alexander points out that researchers that appear prescient one year sometimes predict barely better than chance the next year.

  6. ^

    One can expect people with short timelines to be overrepresented in those who study AI safety, as shorter timelines increase the perceived urgency of working on the problem.

  7. ^

    There have been many cases where AI has gone from zero-to-solved. This is a problem; sudden capabilities are scary.



Discuss

2025 Prediction Thread

30 декабря, 2024 - 04:50
Published on December 30, 2024 1:50 AM GMT

2024 is drawing to a close, which means it's an opportune time to make predictions about 2025. It's also a great time to put probabilities on those predictions, so we can later prove our calibration (or lack thereof).

We just shipped a LessWrong feature to make this easy. Simply highlight a sentence in your comment, and click the crystal-ball icon on the toolbar to turn it a prediction that everyone (who's logged in) can put probability estimates on. The result will look like this:

 

Prediction

 

Some more probabilities that seem cool to elicit (basically all about AI, because that's what's on my mind, but it would be great to have some less AI focused predictions from others)[1]:

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

 

 

 

Prediction

 

 

 

Prediction

 

 

 

Prediction

 

  1. ^

    Unless otherwise specified assume all predictions are about the state of the world at midnight PT, Dec 31st 2025. Also some things won't be perfectly operationalized. Assume that I am going to be judging the resolution using my best judgement.



Discuss

Learn to write well BEFORE you have something worth saying

30 декабря, 2024 - 02:42
Published on December 29, 2024 11:42 PM GMT

I’ve been reading a lot of trip reports lately. Trip reports are accounts people write about their experiences doing drugs, for the benefit of other people who might do those same drugs. I don’t take illegal drugs myself, but I like learning about other people’s intense experiences, and trip reports are little peeks into the extremes of human consciousness. 

In some of these, people are really trying to communicate the power and revelation they had on a trip. They're trying to share what might be the most meaningful experience of their entire life. 

Here’s another thing: almost all trip reports are kind of mediocre writing.

This is wildly judgmental but I stand by it. Here are some common things you see in them:

  • Focusing on details specific to the situation that don’t matter to the reader. (Lengthy accounting of logistics, who the person was with at what time even when they’re not mentioned again, etc.)
  • Sort of basic descriptions of phenomena and emotions: “I was very scared”. “I couldn’t stop thinking about it.”
  • Cliches: “I was glad to be alive.” “It felt like I was in hell.” “It was an epic struggle.”
  • Insights described in sort of classically-high-sounding abstractions. “I realized that the universe is made of love.” “Everything was nothing and time didn’t exist.” These statements are not explained, even if they clearly still mean a lot to the writer, and do not really communicate the force of whatever was going on there.

It's not, like, a crime to write a mediocre trip report. It's not necessarily even a problem. They’re not necessarily trying to convince you of anything. A lot of them are just what it says on the tin: recording some stuff that happened. I can’t criticize these for being bland, because that seems like trying to critique a cookbook for being insufficiently whimsical: they’re just sharing information.

(...Though you can still take that as a personal challenge; “is this the best prose it can be?” For instance, How to Cook and Eat in Chinese by Chao Yang Buwei is a really well-written cookbook with a whimsical-yet-practical style. There’s always room to grow.)

But some of these trip reports very much do have an agenda, like “communicating crucial insights received from machine elves” or “convincing you not to take drug X because it will ruin your life”. In these cases, the goal would be better served if the writing were good, and boy howdy, my friends: the writing is not good.

Which is a little counter-intuitive, right? You’d think these intense and mind-blowing experiences would automatically give you rich psychic grist for sharing with others, but it turns out, no, accounts of the sublime and life-altering can still be astonishingly mid.

Now certain readers may be thinking, not unreasonably, “that’s because drug-induced revelations aren’t real revelations. The drug's effects makes some thoughts feel important – a trip report can’t explain why a particular 'realization' is important, because there’s nothing behind it.”

But you know who has something new and important to say AND knows why it’s important? Academic researchers publishing their latest work.

But alas, academic writing is also, too frequently, not good. 

And if good ideas made for good writing, you’d expect scientific literature to be the prime case for it. Academic scientists are experts: they know why they made all the decisions they did, they know what the steps do, they know why their findings are important. But that’s also not enough.

Ignore academic publishing and the scientific process itself, let’s just look at the writing. It’s very dense, denser than it needs to be. It does not start with simple ideas and build up, it’s practically designed to tax the reader. It’s just boring, it’s not pleasant to read. The rationale behind specific methods or statistical tests aren’t explained. (See The Journal of Actually Well-Written Science by Etienne Fortier-Dubois for more critique of the standard scientific style.) There’s a whole career field of explaining academic studies to laypeople, which is also, famously, often misleading and bad.

This is true for a few reasons:

First, there’s a floor of how “approachable” or “easy” you can make technical topics. A lot of jargon serves useful purposes, and what’s the point in a field of expertise if you can’t assume your reader is caught up on at least the basics? A description of synthesizing alkylated estradiol derivatives, or a study on the genome replication method of a particular virus, is simply very difficult to make layperson-accessible.

Second, academic publishing and the scientific edifice as it currently stands encourage uniformity of many aspects of research output, including style and structure. Some places like Seeds of Science are pushing back on this, but they’re in the minority.

But third, and this is what trips up the trip-reporters and the scientists alike, writing well is hard. Explaining complicated or abstract or powerful ideas is really difficult. Just having the insight isn’t enough - you have to communicate it well, and that is its own, separate skill.

I don’t really believe in esoterica or the innately unexplainable. “One day,” wrote Jack Kerouac, “I will find the right words, and they will be simple.” Better communication is possible. There are great descriptions of being zonked out of one’s gourd and there is great, informative, readable science writing.

So here’s my suggestion: Learn to write well before you have something you really need to tell people about. Practice it on its own. Write early and often. Write a variety of different things and borrow techniques from writing you like. And once you have a message you actually need to share, you'll actually be able to express it.

(A more thorough discussion of how to actually write well is beyond the scope of this blog post – my point here is just that it’s worth improving. if you’re interested, let me know and I might do a follow-up.)

Thank you Kelardry for reviewing a draft of this post.

Support Eukaryote Writes Blog on Patreon.

Crossposted to: [EukaryoteWritesBlog.comSubstackLessWrong]



Discuss

Страницы