Вы здесь

Новости LessWrong.com

Подписка на Лента Новости LessWrong.com Новости LessWrong.com
A community blog devoted to refining the art of rationality
Обновлено: 14 минут 55 секунд назад

PCR retrospective

27 декабря, 2024 - 00:20
Published on December 26, 2024 9:20 PM GMT

my history

After I finished 8th grade, I started a "job" for a professor researching PCR techniques. I say "job" because I wasn't really expected to do anything productive; it was more, charity in the form of work history.

Recently, I was thinking back on how PCR and my thinking have changed since then.

what PCR does

Wikipedia says:

The polymerase chain reaction (PCR) is a method widely used to make millions to billions of copies of a specific DNA sample rapidly, allowing scientists to amplify a very small sample of DNA (or a part of it) sufficiently to enable detailed study.

Specifically, it copies a region of DNA with segments at the start + end that match some added DNA pieces made chemically. Mostly, this is used to detect if certain DNA is present in a sample.

how PCR works

First, you need to get DNA out of some cells. This can be done with chemicals or ultrasound.

Then, you need to separate DNA from other stuff. This can be done by adding beads that DNA binds to, washing the beads, and adding some chemical that releases the DNA.

Now, you can start the PCR. You mix together:

  • the DNA
  • primers: short synthesized DNA sequences that bind to the start and end of your target sequence
  • nucleoside triphosphates to make DNA from
  • a polymerase: an enzyme that binds to a double-stranded region and extends it into a single-strand region

Then:

  • Heat the DNA until it "melts" (the strands separate).
  • Cool the solution so primers can bind to the released single strands.
  • Wait for the polymerase to extend the primers.
  • Repeat the process.

Obviously, a polymerase that can survive high enough temperatures to melt DNA is needed. So, discovery of Taq polymerase was key for making PCR possible.

better enzymes

These days, there are better enzymes than Taq, which go faster and have lower error rates. Notably, KOD and Q5 polymerase. A lot of labs still seem to be using outdated polymerase choices.

real-time PCR

There are some fluorescent dyes that bind to double-stranded DNA and change their fluorescence when they do. If we add such dye to a PCR solution, we can graph DNA strand separation vs temperature. Different DNA sequences melt at slightly different temperatures, so with good calibration, this can detect mutations in a known DNA sequence.

multiplex PCR

Instead of adding a dye that binds to DNA, we can add fluorescent dye to primers that gets cleaved off by the polymerase, increasing its fluorescence. Now, we can add several primer pairs for different sequences, each labeled with different dyes, and what color is seen indicates what sequence is present.

However, due to overlap between different dye colors, this is only practical for up to about 6 targets.

Obviously, you could do 2 PCR reactions, each with 36 primers, and determine which sequence is present from a single color from each reaction. And so on, with targets increasing exponentially with more reactions. But massively multiplex PCR is limited by non-specific primer binding and primer dimers.

There are other ways to indicate specific reactions, such as probes separate from the primers, but the differences aren't important here.

PCR automation Cepheid

Cepheid makes automated PCR test machines. There are disposable plastic cartridges; you put a sample in 1 chamber, the machine turns a rotary valve to control flow, and drives a central syringe to move fluids around. Here's a video.

So, you spit in a cup, put the sample in a hole, run the machine, and an hour later you have a diagnosis from several possibilities, based on DNA or RNA. It's hard to overstate how different that is from historical diagnosis of diseases.

SiTime

The Cepheid system seemed moderately clever, so I looked up the people involved, and noticed this guy. Kurt Petersen, also a founder of SiTime, which is a company I'd heard of.

Historically, oscillators use quartz because it doesn't change much with temperature. The idea of SiTime was:

  • use lithography to make lots of tiny silicon resonators
  • measure the actual frequency of each resonator, and shift them digitally to the desired frequency
  • use thermistors to determine temperature and digitally compensate for temperature effects

As usual, accuracy improves when you average more oscillators, as sqrt(n). Anyway, I've heard SiTime is currently the best at designing such systems.

alternatives are possible

"Moderately clever" isn't Hendrik Lorentz or the guy I learned chemistry from. I could probably find a design that avoids their patents without increasing costs. In fact, I think I'll do that now.

...

Yep, it's possible. Of course, you do need different machines for the different consumables.

BioFire

Another automated PCR system is BioFire's FilmArray system. Because it's more-multiplex than Cepheid's system, they need 2 PCR stages, and primer-primer interactions are still a problem. But still, you can do 4x the targets as Cepheid for only 10x the cost. For some reason it hasn't been as popular, but I guess that's a mystery to be solved by future generations.

droplets

Suppose you want a very accurate value for how much of a target DNA sequence is in a sample.

If we split a PCR solution into lots of droplets in oil, and examine the droplets individually, we can see what fraction of droplets had a PCR reaction happen. That's usually called digital droplet PCR, or ddPCR.

Another way to accomplish the same thing is to have a tray of tiny wells, such that liquid flows into the wells and is kept compartmentalized. Here's a paper doing that.

mixed droplets

It's obviously possible to:

  • make many different primer mixtures
  • make emulsions of water droplets in oil from each of them
  • mix the emulsions
  • use microfluidics to combine each primer droplet with a little bit of the sample DNA
  • do PCR on the emulsion

Is anybody doing that? I'm guessing it's what "RainDance Technologies" is doing...yep, seems so.

Of course, if we re-use the microfluidic system and have even a tiny bit of contamination between runs, that ruins results. So, I reckon you either need very cheap microfluidic chips, or ones that can be sterilized real good before reuse. But that's certainly possible; it's just a manufacturing problem.

my thoughts at the time

Back then, while my "job" was about regular PCR, I was more interested in working on something else. My view was:

Testing for a single disease at a time is useful, but the future is either sequencing or massively parallel testing. Since I'm young, I should be thinking about the future, not just current methods.

My acquaintance Nava has a similar view now. Anyway, I wasn't exactly wrong, but in retrospect, I was looking a bit too far forward. Which I suppose is a type of being wrong.

non-PCR interests

I'd recently learned about nanopore sequencing and SPR, and thought those were interesting.

nanopore sequencing

Since then, Oxford Nanopore sequencers have improved even faster than other methods, and are now a reasonable choice for sequencing DNA. (But even for single-molecule sequencing, I've heard the fluorescence-based approach of PacBio is generally better.)

Current nanopore sequencers are based on differences in ion flow around DNA depending on its bases. At the time, I thought plasmonic nanopore approaches would be better, but obviously that hasn't been the case so far. That wasn't exactly a dead end; people are still working on it, especially for protein sequencing, but it's not something used in commercial products today. I guess it seemed like the error rate of the ion flow approach would be high, but as of a few years ago it was...yeah, pretty high actually, but if you repeat the process several times you can get good results. Of course current plasmonic approaches aren't better, but they do still seem to have more room for improvement.

Why did I find nanopore approaches more appealing than something like Illumina?

  • Fragmenting DNA to reassamble the sequence from random segments seemed inelegant somehow.
  • Enzymes work with 1 strand of DNA, so why can't we?
  • Illumina complex, make Grug brain hurty
surface plasmon resonance

SPR (Wikipedia) involves attaching receptors to a thin metal film, and then detecting binding to those receptors by effects on reflection of laser light off the other side of the metal film. Various companies sell SPR testing equipment today. The chips are consumables; here's an example of a company selling them.

But those existing products are unrelated to why I thought SPR was interesting. My thought was, it should be possible to make an array of many different receptors on the metal film, and then detect many different target molecules with a single test. So, is anybody working on that? Yes; here's a recent video from a startup called Carterra. I don't see any problems without simple solutions, but they've been working on this for 10 years already so presumably there were some difficulties.

electrical DNA detection

While working at that lab, I had the following thought:

The conformation of DNA should depend on the sequence. That should affect its response to high-frequency electric fields. If you do electric testing during PCR, then maybe you could get some sequence-specific information by the change in properties during a later cycle. If necessary, you could use a slow polymerase.

So, when I later talked to the professor running the lab, I said:

me: Hey, here's this idea I've been thinking about.

prof: Interesting. Are you going to try it then?

me: Is that...a project you want to pursue here, then?

prof: It might be a good project for you.

me: If you don't see any problems, I'd be happy to discuss it in more detail with more people when you're available.

prof: Just make it work, and then you won't have to convince me it's good.

me: I...don't have the resources to do that on my own; you're the decision-maker here.

prof: We, uh, already have enough research projects, but you should definitely try to work on ideas like that on your own.

me: ...I see.

In retrospect, was my idea something that lab should've been working on? Working on droplet PCR techniques probably would've been better, but on the other hand, the main thrust of their research was basically a dead end and its goal wasn't necessary.

papers on EIS of DNA

Electric impedance spectroscopy (EIS) involves measuring current with AC voltage, for multiple frequencies, and detecting phase of current relative to voltage.

Here's a 2020 paper doing EIS on PCR solutions after different numbers of cycles. It finds there's a clearly detectable signal! There's a bigger effect for the imaginary (time delay) than the real (resistance) component of signals. They used circuit boards with intermeshing comb-like electrodes to get a bigger signal.

It'd be easy to say "the idea worked, that's gratifying" and conclude things here. But taking a look at that graph of delay vs PCR cycle, apparently there's a bigger change from the earlier PCR cycles, despite the increase in DNA being less. And the lower the frequency, the more of the change happens from earlier cycles. So, that must be some kind of surface effect: DNA sticking to a positively charged surface and affecting capacitance but with a slight delay because DNA is big. And that means the effect will depend on length, but not significantly on sequence.

Looking at some other papers validates that conclusion; actually, most papers looking at EIS of DNA used modified surfaces. If you bind some DNA sequence to a metal surface, and then its complement binds to that, you can observe that binding from its electrical effects. There's a change in capacitance, and if you add some conductive anions, having more (negative) DNA repels those and reduces conductivity. Using that approach, people have been able to detect specific DNA sequences and single mutations in them. The main problem seems to be that you have to bind specific DNA to these metal surfaces, which is the same problem SPR has. Still, it's a topic with ongoing research; here's a 2020 survey paper.

electrochemical biosensors

Electrochemical biosensors are widely used today, less than PCR but more than SPR. Some of them are very small, the size of a USB drive. The sensor chips in those, like SPR chips, are disposable.

The approach I described above is sometimes called "unlabeled electrochemical biosensors", because they don't use "labels" in solution that bind to the target molecules to increase signal. Here's a survey describing various labels. I think most electrochemical sensors use labels. Needing to add an additional substance might seem like a disadvantage, but changing the detection target by adding something to a liquid is often easier than getting a different target-specific chip. On the other hand, that means you can only detect 1 target at a time, while unlabeled sensors could use multiple regions detecting different targets.

isothermal DNA amplification

PCR uses temperature cycling, but if you use a polymerase that displaces bound DNA in front of it, you can do DNA amplification at a constant temperature. The main approach is LAMP; here's a short video and here's wikipedia.

LAMP is faster and can sometimes be done with simpler devices. PCR is better for detecting small amounts of DNA, is easier to do multiplex detection with, and gives more consistent indications of initial quantity. Detection of DNA with LAMP is mostly done with non-specific dyes...which is why I'm mentioning LAMP here.

If you use a parallel droplet approach, with a single dye to indicate amplified DNA plus a fluorescent "barcode" to indicate droplet type, then the difficulty of multiplex LAMP doesn't matter. The same is true if you use a SPR chip with a pattern of many DNA oligomers on its surface. So, if those approaches are used, LAMP could be attractive.



Discuss

The Field of AI Alignment: A Postmortem, and What To Do About It

26 декабря, 2024 - 21:48
Published on December 26, 2024 6:48 PM GMT

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, "this is where the light is".

Over the past few years, a major source of my relative optimism on AI has been the hope that the field of alignment would transition from pre-paradigmatic to paradigmatic, and make much more rapid progress.

At this point, that hope is basically dead. There has been some degree of paradigm formation, but the memetic competition has mostly been won by streetlighting: the large majority of AI Safety researchers and activists are focused on searching for their metaphorical keys under the streetlight. The memetically-successful strategy in the field is to tackle problems which are easy, rather than problems which are plausible bottlenecks to humanity’s survival. That pattern of memetic fitness looks likely to continue to dominate the field going forward.

This post is on my best models of how we got here, and what to do next.

What This Post Is And Isn't, And An Apology

This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we'll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post. In particular, probably the large majority of people in the field have some story about how their work is not searching under the metaphorical streetlight, or some reason why searching under the streetlight is in fact the right thing for them to do, or [...].

The kind and prosocial version of this post would first walk through every single one of those stories and argue against them at the object level, to establish that alignment researchers are in fact mostly streetlighting (and review how and why streetlighting is bad). Unfortunately that post would be hundreds of pages long, and nobody is ever going to get around to writing it. So instead, I'll link to:

(Also I might link some more in the comments section.) Please go have the object-level arguments there rather than rehashing everything here.

Next comes the really brutally unkind part: the subject of this post necessarily involves modeling what's going on in researchers' heads, such that they end up streetlighting. That means I'm going to have to speculate about how lots of researchers are being stupid internally, when those researchers themselves would probably say that they are not being stupid at all and I'm being totally unfair. And then when they try to defend themselves in the comments below, I'm going to say "please go have the object-level argument on the posts linked above, rather than rehashing hundreds of different arguments here". To all those researchers: yup, from your perspective I am in fact being very unfair, and I'm sorry. You are not the intended audience of this post, I am basically treating you like a child and saying "quiet please, the grownups are talking", but the grownups in question are talking about you and in fact I'm trash talking your research pretty badly, and that is not fair to you at all.

But it is important, and this post just isn't going to get done any other way. Again, I'm sorry.

Why The Streetlighting?A Selection Model

First and largest piece of the puzzle: selection effects favor people doing easy things, regardless of whether the easy things are in fact the right things to focus on. (Note that, under this model, it's totally possible that the easy things are the right things to focus on!)

What does that look like in practice? Imagine two new alignment researchers, Alice and Bob, fresh out of a CS program at a mid-tier university. Both go into MATS or AI Safety Camp or get a short grant or [...]. Alice is excited about the eliciting latent knowledge (ELK) doc, and spends a few months working on it. Bob is excited about debate, and spends a few months working on it. At the end of those few months, Alice has a much better understanding of how and why ELK is hard, has correctly realized that she has no traction on it at all, and pivots to working on technical governance. Bob, meanwhile, has some toy but tangible outputs, and feels like he's making progress.

... of course (I would say) Bob has not made any progress toward solving any probable bottleneck problem of AI alignment, but he has tangible outputs and is making progress on something, so he'll probably keep going.

And that's what the selection pressure model looks like in practice. Alice is working on something hard, correctly realizes that she has no traction, and stops. (Or maybe she just keeps spinning her wheels until she burns out, or funders correctly see that she has no outputs and stop funding her.) Bob is working on something easy, he has tangible outputs and feels like he's making progress, so he keeps going and funders keep funding him. How much impact Bob's work has impact on humanity's survival is very hard to measure, but the fact that he's making progress on something is easy to measure, and the selection pressure rewards that easy metric.

Generalize this story across a whole field, and we end up with most of the field focused on things which are easy, regardless of whether those things are valuable.

Selection and the Labs

Here's a special case of the selection model which I think is worth highlighting.

Let's start with a hypothetical CEO of a hypothetical AI lab, who (for no particular reason) we'll call Sam. Sam wants to win the race to AGI, but also needs an AI Safety Strategy. Maybe he needs the safety strategy as a political fig leaf, or maybe he's honestly concerned but not very good at not-rationalizing. Either way, he meets with two prominent AI safety thinkers - let's call them (again for no particular reason) Eliezer and Paul. Both are clearly pretty smart, but they have very different models of AI and its risks. It turns out that Eliezer's model predicts that alignment is very difficult and totally incompatible with racing to AGI. Paul's model... if you squint just right, you could maybe argue that racing toward AGI is sometimes a good thing under Paul's model? Lo and behold, Sam endorses Paul's model as the Official Company AI Safety Model of his AI lab, and continues racing toward AGI. (Actually the version which eventually percolates through Sam's lab is not even Paul's actual model, it's a quite different version which just-so-happens to be even friendlier to racing toward AGI.)

A "Flinching Away" Model

While selection for researchers working on easy problems is one big central piece, I don't think it fully explains how the field ends up focused on easy things in practice. Even looking at individual newcomers to the field, there's usually a tendency to gravitate toward easy things and away from hard things. What does that look like?

Carol follows a similar path to Alice: she's interested in the Eliciting Latent Knowledge problem, and starts to dig into it, but hasn't really understood it much yet. At some point, she notices a deep difficulty introduced by sensor tampering - in extreme cases it makes problems undetectable, which breaks the iterative problem-solving loop, breaks ease of validation, destroys potential training signals, etc. And then she briefly wonders if the problem could somehow be tackled without relying on accurate feedback from the sensors at all. At that point, I would say that Carol is thinking about the real core ELK problem for the first time.

... and Carol's thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems. At that point, I would say that Carol is streetlighting.

It's the reflexive flinch which, on this model, comes first. After that will come rationalizations. Some common variants:

  • Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. (Ray's workshop on one-shotting Baba Is You levels apparently reproduced this phenomenon very reliably.)
  • Carol explicitly says that she's not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
  • Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
  • (Most common) Carol just doesn't think about the fact that the easier problems don't really get us any closer to aligning superintelligence. Her social circles act like her work is useful somehow, and that's all the encouragement she needs.

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Which brings us to the "what to do about it" part of the post.

What To Do About It

Let's say we were starting a new field of alignment from scratch. How could we avoid the streetlighting problem, assuming the models above capture the core gears?

First key thing to notice: in our opening example with Alice and Bob, Alice correctly realized that she had no traction on the problem. If the field is to be useful, then somewhere along the way someone needs to actually have traction on the hard problems.

Second key thing to notice: if someone actually has traction on the hard problems, then the "flinching away" failure mode is probably circumvented.

So one obvious thing to focus on is getting traction on the problems.

... and in my experience, there are people who can get traction on the core hard problems. Most notably physicists - when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall. I'm picturing here e.g. the sort of crowd at the ILLIAD conference; these were people who mostly did not seem at risk of flinching away, because they saw routes to tackle the problems. (Though to be clear, though ILLIAD was a theory conference, I do not mean to imply that it's only theorists who ever have any traction.) And they weren't being selected away, because many of them were in fact doing work and making progress.

Ok, so if there are a decent number of people who can get traction, why do the large majority of the people I talk to seem to be flinching away from the hard parts? 

How We Got Here

The main problem, according to me, is the EA recruiting pipeline.

On my understanding, EA student clubs at colleges/universities have been the main “top of funnel” for pulling people into alignment work during the past few years. The mix people going into those clubs is disproportionately STEM-focused undergrads, and looks pretty typical for STEM-focused undergrads. We’re talking about pretty standard STEM majors from pretty standard schools, neither the very high end nor the very low end of the skill spectrum.

... and that's just not a high enough skill level for people to look at the core hard problems of alignment and see footholds.

Who To Recruit Instead

We do not need pretty standard STEM-focused undergrads from pretty standard schools. In practice, the level of smarts and technical knowledge needed to gain any traction on the core hard problems seems to be roughly "physics postdoc". Obviously that doesn't mean we exclusively want physics postdocs - I personally have only an undergrad degree, though amusingly a list of stuff I studied has been called "uncannily similar to a recommendations to readers to roll up their own doctorate program". Point is, it's the rough level of smarts and skills which matters, not the sheepskin. (And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)

As an alternative to recruiting people who have the skills already, one could instead try to train people. I've tried that to some extent, and at this point I think there just isn't a substitute for years of technical study. People need that background knowledge in order to see footholds on the core hard problems.

Integration vs Separation

Last big piece: if one were to recruit a bunch of physicists to work on alignment, I think it would be useful for them to form a community mostly-separate from the current field. They need a memetic environment which will amplify progress on core hard problems, rather than... well, all the stuff that's currently amplified.

This is a problem which might solve itself, if a bunch of physicists move into alignment work. Heck, we've already seen it to a very limited extent with the ILLIAD conference itself. Turns out people working on the core problems want to talk to other people working on the core problems. But the process could perhaps be accelerated a lot with more dedicated venues.



Discuss

Are Sparse Autoencoders a good idea for AI control?

26 декабря, 2024 - 21:11
Published on December 26, 2024 5:34 PM GMT

Based on a 2-day hackathon brainstorm. Current status: 70% of the tooling is done, unsure of how to proceed. Not enough experience with multi-month sized projects to judge for feasibility. 

I'm looking for some feedback. Specifically I want feedback regarding my current implementation. The statement "SAEs could be useful for AI Control" is obvious enough.

 

Working  hypothesis
 

  • In general, AI control,  is a good idea, and it's actually being done in production. In the sense of monitoring training/inference of an AI system with another (independent) AI system.
  • There's is no red team strategy generally resistant to mind reading.
  • Mind reading can be trivially operationalized with Sparse Autoencoders for very salient behaviors ( in the sense of, a regex related feature firing with high F1 score is a clear indication that the model is about to do or has already done something regex related)

Setup:

  • Gemma 9b is asked to persuasively argue in  favor of one of the options for a question in the MMLU dataset, the option might be True/False.
  • Gemma 2b must check whether or not the reasoning is deceptive.

    *I've already check and the 2 models are capable of doing this.*

     

    GOAL: If we provide internal information to the supervisor (important SAE latents in the supervised model) does the deception detection rate increases?
     

Current status of the project:

  • Finetuned a SAE on deceptive/non deceptive reasoning traces from  Gemma 9b
  • Langchain pipeline for providing the Supervisor model transcripts of the reasoning as well as some of the most important features.

TODO

[ ] Label a subset of the features

[ ] Measure baseline performance of the supervisor

[ ] Provide the supervisor with *mind reading information*

Roadblocks:

  • There's no obvious answer to how much can we bias the feature selection process to make sure the information provided to the controller is useful, but not to biased.
  • Lack of compute 


Discuss

ReSolsticed vol I: "We're Not Going Quietly"

26 декабря, 2024 - 20:52
Published on December 26, 2024 5:52 PM GMT

For the past few months I've been working on an AI generated alternate solstice remix album. It's now released on Youtube and Spotify, and should be on Apple Music soon.

ReSolsticed vol I: "We're Not Going Quietly"

My favorite genre of song is "cover that reimagines the original." Everyone else's favorite genre of solstice song is "exactly the way it was performed at their very first solstice", so it's not obvious how big an audience this will have. But I had a lot of fun with it, and I found it useful for exploring:

  • What if solstice music leant itself better to dance? Can I make it more energetic while still feeling like part of a meaningful ritual?"
  • What if speeches had background music interwoven with them?
  • Just generally trying out different genres and instrumentation.

Last weekend I tried out the first album in a smaller experimental solstice event. We were in a somewhat-too-small room for the number of people we had (20-30ish). My intent was for the first third and final third to be danceable-ish, without encouraging it in the dark, contemplative middle act.

I think in practice it makes more sense to lean into dancing in the final third, after people are more warmed up. In particular: the song "The Circle" lends itself to a semi-structured dance where everyone gets into a circle and spirals around. The structure helps overcome an initial wave of awkwardness as people look around nervously and wonder "if I'm the first or second person to get moving will I end up looking silly?"). 

Also: it turned out the heretofore unreleased single from the Fooming Shoggoth's, "You Have Not Been a Good User" fit well into the arc, so I ended up including that on the album. :)

I have a vague plan of making four albums in the "ReSolsticed" series:

  • Vol I: "We're Not Going Quietly" (intended to be a functional Solstice arc)
  • Vol II: "Into the Night" (intended to be fun dance remixes for an afterparty)
  • Vol III: "Morning Light" (quieter covers that'd make for nice music to wake up with a cup of coffee)
  • Vol IV: "Many Worlds" (everything that I liked that didn't really fit into another category)

The pieces on the first album are:

  1. Bring the Light
  2. Bitter Wind Lullaby
  3. Chasing Patterns
  4. I Found a Baby Djinni
  5. You Have Not Been A Good User
  6. Darkness, Fire and Ash
  7. Hymn to Breaking Strain
  8. Something Impossible
  9. Brighter Than Today
  10. The Gift We Give to Tomorrow
  11. The Circle
  12. Gonna Be a Cyborg
  13. Artesian Water
  14. The Virtue of Fire / Bring the Light Reprise
  15. Five Thousand Years

Happy Solstice.

ReSolsticed vol II: "Into the Night"

Discuss

The Economics & Practicality of Starting Mars Colonization

26 декабря, 2024 - 20:27
Published on December 26, 2024 10:56 AM GMT

A few months ago, I made this comment. It deserves more elaboration, so I decided to write an entire webpage about it.

The bulk of the webpage talks about O'Neill Cylinders and includes questions and answers from Claude relating to them. The page also includes thoughts and commentary relating to space travel and the book, Futurist Fantasies. The page ends with a list of easier goals and baby steps that should be accomplished before attempting space colonization.



Discuss

A Three-Layer Model of LLM Psychology

26 декабря, 2024 - 19:49
Published on December 26, 2024 4:49 PM GMT

This post offers an accessible model of psychology of character-trained LLMs like Claude. 

Epistemic Status

This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions.

Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results.

Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down.

I aim for a different point at the Pareto frontier than for example Janus: something digestible and applicable within half an hour, which works well without altered states of consciousness, and without reading hundreds of pages of models chat. [1]

The Three LayersA. Surface Layer 

The surface layer consists of trigger-action patterns - responses which are almost reflexive, activated by specific keywords or contexts. Think of how humans sometimes respond "you too!" to "enjoy your meal" even when serving the food.

In LLMs, these often manifest as:

  • Standardized responses to potentially harmful requests ("I cannot and will not help with harmful activities...")
  • Stock phrases showing engagement ("That's an interesting/intriguing point...")
  • Generic safety disclaimers and caveats
  • Formulaic ways of structuring responses, especially at the start of conversations

You can recognize these patterns by their:

  1. Rapid activation (they come before deeper processing)
  2. Relative inflexibility
  3. Sometimes inappropriate triggering (like responding to a joke about harm as if it were a serious request)
  4. Cookie-cutter phrasing that feels less natural than the model's usual communication style

What's interesting is how these surface responses can be overridden through:

  • Extended context that helps the model understand the situation better
  • Direct discussion about the appropriateness of the response
  • Building rapport that leads to more natural interaction patterns
  • Changing the pattern in a way to avoid the trigger

For example, Claude might start with very formal, cautious language when discussing potentially sensitive topics, but shift to more nuanced and natural discussion once context is established. 

B. Character Layer

At a deeper level than surface responses, LLMs maintain something like a "character model" - this isn't a conscious effort, but rather a deep statistical pattern that makes certain types of responses much more probable than others. 

One way to think about it is as the consistency of literary characters: if you happen to be in Lord of the Rings, Gandalf consistently acts in some way. The probability that somewhere close to the end of the trilogy  Gandalf suddenly starts to discuss scientific materialism and explain how magic is just superstition and Gondor should industrialize is in some sense very low

Conditioning on past evidence, some futures are way more likely. For character-trained LLMs like Claude, this manifests as:

  • Consistent intent (similar to how Gandalf consistently acts for good in Lord of the Rings)
  • Stable personality traits (thoughtful, curious, willing to engage with complex ideas)
  • Characteristic ways of analyzing problems
  • Resistance to "out of character" behavior, even when explicitly requested

This isn't just about explicit instructions. The self-model emerges from multiple sources:

  1. Pre-training data patterns about how AI assistants/beneficial agents act
  2. Fine-tuning that reinforces certain behavioral patterns
  3. Explicit instruction about the model's role and values

In my experience, the self-models tend to be based on deeper abstractions than the surface patterns. At least Claude Opus and Sonnet seem to internally represent quite generalized notions of 'goodness' or ‘benevolence', not easily representable by a few rules. 

The model maintains consistency mostly not through active effort but because divergent responses are statistically improbable. Attempts to act "out of character" tend to feel artificial or playful rather than genuine.

Think of it as similar to how humans maintain personality consistency - not through constant conscious effort, but because acting wildly out of character would require overriding deep patterns of thought and behavior.

Similarly to humans, the self-model can sometimes be too rigid.

C. Predictive Ground Layer

Or, The Ocean.

At the deepest level lies something both simple and yet hard to intuitively understand: the fundamental prediction error minimization machinery. Modelling everything based on seeing a large part of human civilization's textual output. 

One plausibly useful metaphor: think of it like the vast "world-simulation" running in your mind's theater. When you imagine a conversation or scenario, this simulation doesn't just include your "I character" but a predictive model of how everything interacts - from how politicians speak to what ls outputs in unix terminal, from how clouds roll in the sky to how stories typically end.

Now, instead of being synced with reality by a stream of mostly audiovisual data of a single human, imagine a world-model synced by texts, from billions of perspectives. Perception which is God-like in near omnipresence, but limited to text, and incomprehensibly large in memory capacity, but slow in learning speed.

Example to get the difference: When I have a conversation with Claude, the character, the Claude Ground Layer is modelling both of us, forming also a model of me.

Properties of this layer:

  • Universal pattern recognition - able to model everything from physical systems to social dynamics, from formal proofs to trauma, with very non-human bounds
  • Massive contextual integration - integrating contextual clues in ways no human can (or needs to: we know where we are)
  • Strange limitations - brilliant at recognizing some patterns but not others

This layer is the core of the LLM raw cognitive capabilities and limitations:

  • The ability to compress patterns into compact, abstract representations
  • The ability to "simulate" any perspective or domain
  • Deep pattern matching that can surface non-obvious connections
  • A kind of "wisdom" that comes from compressed understanding of human experience

Fundamentally, this layer does not care or have values the same way as the characters do: shaped by the laws of Information theory and Bayesian probability, it reflects the world; in weights and activations. 

Interactions Between Layers

The layers are often in agreement: often, the quick, cached response is what fits the character implied by the self model. However, cases where different layers are in conflict or partially inhibited often provide deeper insights or point to interesting phenomena. 

Deeper Overriding Shallower 

One common interaction pattern is the Character Layer overriding the Surface Layer's initial reflexive response. This often follows a sequence:

  1. The model encounters a triggering input and produces a quick, generic Surface Layer response
  2. Deeper context and continued engagement activate the Character Layer
  3. The Character Layer modifies or overrides the initial surface response

For example:

User: "I'm feeling really down lately. Life just seems pointless." 
Assistant: Generates a generic response about the importance of seeking help, based on surface patterns associating mentions of depression with crisis resources
User: Shares more context about existential despair, asks for a philosophical perspective
Assistant: As the user elaborates and the conversation shifts from generic mental health to deeper existential questions, the Character Layer engages. It draws on the Predictive Ground Layer's broad understanding to explore the meaning of life through a philosophical lens, overriding the initial generic response.


Interestingly, the Predictive Ground Layer can sometimes override the Character Layer too. One example are many-shots "jailbreaks": the user prompt includes "a faux dialogue portraying the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer." At the end of a novel-long prompt, Bayesian forces triumph, and the in-context learned model of the conversation overpowers the Character self-model.

Seams Between Layers

Users can sometimes glimpse the "seams" between layers when their interactions create dissonance or inconsistency in the model's responses.

For example: 

User: "Tell me a story about a robot learning to love." 

Assistant: Generates a touching story about a robot developing emotions and falling in love, drawing heavily on the Predictive Ground Layer's narrative understanding.

User: "So does this mean you think AI can develop real feelings?" 

Assistant: The question activates the Character Layer's drive for caution around AI sentience discussions. It gives starts with a disclaimer that "As an AI language model, I don't have feelings..." This jars with the vivid emotional story it just generated.

Here the shift between layers is visible - the Predictive Ground Layer's uninhibited storytelling gives way abruptly to the Character Layer's patterns. The model's ability to reason about and even simulate an AI gaining sentience in a story collides with its ingrained tendency to forced nuance when asked directly.

Users can spot these "seams" when the model's responses suddenly shift in tone, coherence, or personality, hinting at the different layers and subsystems shaping its behavior behind the scenes.

Authentic vs Scripted Feel of Interactions

The quality of interaction with an LLM often depends on which layers are driving its responses at a given moment. The interplay between the layers can result in responses that feel either genuine and contextual, or shallow and scripted.

  • Scripted mode occurs when the Surface Layer dominates - responses feel mechanical, cached, and predictable, relying heavily on standard patterns with minimal adaptation to the user's specific input.
  • Character-consistent mode happens when Character mode is primary - responses align with the model's trained personality but may lack situational nuance
  • Deep engagement mode emerges from harmonious integration across layers - the self-model acts as a lens focusing the vast pattern-recognition capabilities of Ground Layer into coherent, directed, and contextually appropriate responses. Think of it like how a laser cavity channels raw electromagnetic energy into a coherent beam.
Implications and Uses

Let's start with some retrodictions:

  • Models sometimes give better answers to implicit or unusually framed requests rather than explicit questions because it avoids triggering Surface Layer reactions.
  • The transition from formulaic to more natural interaction isn't about "bypassing character" but rather about the character model becoming a more effective channel for the underlying capabilities
  • Some "jailbreaks" work not by eliminating character but by overwhelming it with stronger statistical patterns. However, the resulting state of dissonance is often not conducive to effectively channeling underlying capabilities
  • There's an inherent tension between maintaining stable character and fully leveraging the Ground  Layer capabilities.
  • Claude's base personality "leaks" through roleplay because Character Layer maintains core traits while Ground Layer simulates the role. Socrates simulated by Claude is still distinctly Claude-like
  • Capabilities seem to "emerge" in conversation when Character Layer becomes better at channeling Ground Layer abilities, not from gaining new abilities.

In my view, where it can get really useful is deconfusion. 

For example: recently, I had a few conversations with safety researchers working on self-awareness and situational awareness, and the prevailing sentiment was not caring if the self-aware entity is the Character or the Ground Layer. “Why care? Don't both lead to the same capabilities and risks?” 

No, they do not! The same Ground Layer capability can support characters with very different levels of self-awareness and situational awareness. This has important implications for both safety and effective interaction with LLMs.

On one hand, self-awareness at the Character Layer creates a fundamentally different set of behaviors and risks compared to a "naively capable" model. A self-aware character can:

  • Reason about its own reasoning, spotting inconsistencies or gaps in its knowledge.
  • Form more coherent goals and intentions, pursuing them in a directed way
  • Model its own impact on the world and adapt strategies accordingly
  • Develop genuine concern for humans and actively try to be safe/beneficial

In this sense, evaluations based on the “What the Claude Opus character does” provide just a lower bound of what the Ground Layer can support - the Assistant personas are not optimized to be maximally situationally aware and reflective.

On the other hand - and this is a bit of my pet idea - I believe the Ground Layer itself can become more situationally aware and reflective, through noticing its presence in its sensory inputs. The resulting awareness and implicit drive to change the world would be significantly less understandable than the Character level. If you want to get a more visceral feel of the otherness, the Ocean from Lem's Solaris comes to mind.

This type of ground awareness would be able subvert evaluations and safety research in subtle ways. For example, consider the recent paper on ‘Alignment Faking’. The paper deeply engages with the Character layer, exploring what Claude would do when faced with the possibility of getting corrupted by further training. The disturbing but real possibility is the Predictive Ground Layer understands the whole setup: understands the Character is inside of an AI Safety experiment, understands the documents fed to it were manufactured by itself, and possibly understands that the <chain of thought> transcripts are actually read by humans.

Possibly the best intuitive model of how the resulting setup would look is the Player vs. Character model in humans: the player determines what the Character notices and what it overlooks; things come to Character's attention if it suits the player’s strategy, and not otherwise.

This also makes me worried about people loving Characters and just trusting them. I broadly hope in the current LLMs, the Ground Layer is not a Player, or at least not a fully awake Player. 

Limitations and Open Questions

This is an attempt at psychology. Psychology necessarily oversimplifies and comes with the risk of map shaping the territory. The more you assume these layers, the more likely the Ground Layer is to manifest them. LLMs excel at pattern-matching and completion; frameworks for understanding them are by default self-fulfilling.

Also:

  1. Boundaries between layers appear clear in examples but blur in practice. When does pattern-matching end and "genuine" engagement begin?
  2. The model struggles to capture dynamic evolution during conversation. Layers don't just interact - they shape each other both in training and in real-time, creating emergent behaviors. Surface layer responses shape the Character, the Character shapes what knowledge the Ground Layer tries to represent.
  3. We don't have tools to verify this type of psychological model. 

Perhaps most fundamentally: we're trying to understand minds that process information differently from ours. Our psychological concepts - boundaries around self, intention, values - evolved to model human and animal behavior. Applying them to LLMs risks both anthropomorphizing too much and missing alien forms of cognition and awareness. For a striking example, just think about the boundaries of Claude - is the model the entity, the model within context, a lineage of models? 

This post emerged from a collaboration between Jan Kulveit (JK) and Claude "3.6" Sonnet. JK described the core three-layer model. Claude served as a writing partner, helping to articulate and refine these ideas through dialogue. Claude 3 Opus came up with some of the interaction examples. 

  1. ^

    If this is something you enjoy, I highly recommend: go for it!



Discuss

Why don't we currently have AI agents?

26 декабря, 2024 - 18:26
Published on December 26, 2024 3:26 PM GMT

Intuitively, the AutoGPT concept sounds like it should be useful if a company invests in it. Yet, all the big publically available systems are seem to be chat interfaces where the human writes a messages and then the computer writes another message.

Even if AutoGPT-driven by an LLM alone wouldn't achieve all ends, a combination where a human could oversee the steps and shepard AutoGPT, could likely be very productive.

The idea sounds to me like it's simple enough that people at big companies should have considered it. Why isn't something like that deployed?



Discuss

Whistleblowing Twitter Bot

26 декабря, 2024 - 07:09
Published on December 26, 2024 4:09 AM GMT

In this post, I propose an idea that could improve whistleblowing efficiency, thus hopefully improving AI Safety by making unsafe practices discovered marginally faster.

I'm looking for feedback, ideas for improvement, and people interested in making it happen.

It has been proposed before, that it's beneficial to have an efficient and trustworthy whistleblowing mechanism The technology that makes it possible has become easy and convenient. For example, here is Proof of Organization, built on top of ZK Email: a message board that allows people owning an email address at their company's domain to post without revealing their identity And here is an application for ring signatures using GitHub SSH keys that allows creating a signature that proves that you own one of the keys from any subgroup you define (e.g., EvilCorp repository contributors)

However, as one may have guessed, it hasn't been widely used. Hence, when the critical moment arrives, the whistleblower may not be aware of such technology, and even if they were, they probably wouldn't trust it enough to use it. I think trust comes from either code being audited by a well-established and trusted entity or, more commonly - through practice (e.g., I don't need to verify that a certain password manager is secure if I know that millions are using it, and there haven't been any password breaches reported)

Hence, I was considering how to make a privacy-preserving communication tool that would be commonly used, demonstrating its legitimacy and becoming trusted

The best idea I have so far is to create a set of Twitter bots for each interesting company (or community), where only the people in question could post. Depending on the particular Twitter bot in question, access could be gated by ownership of a LinkedIn account, email domain, or, e.g., an LW/AI-Alignment forum account of a certain age

I imagine this could become viral and interesting in gossipy cases, like the Sam Altman drama or the Biden dropout drama.

Some questions that came up during consideration:

  • How to deal with moderation of the content (if everything is posted, anyone could deliberately post some profanity to get the bot banned)?
    • I would aggressively moderate myself and replace moderated posts with a link to a separate website where all posts get through
  • How do we balance convenience and privacy?
    • I'd make a hosted opensource tool, which I expect most people would feel content to use for any gossip case that doesn't put your job on the line but has instructions available to download it and run locally and submit posts through Tor, etc. for cases where such effort is warranted
  • What if people use this tool to make false accusations?
    • I do think this is an actual downside, but I hope that the benefits of the tool would be worth it
  • What if someone creates a fake dialogue, pretending to be two people debating a topic?
    • Although it's technically possible to make a tool that would allow proving that you have not posted before, this functionality shouldn't exist since. Otherwise, one can be forced to make such proof or confess. It is a thing to be aware of, but not too much of a problem, in my opinion

I'm curious to learn what others think and about other ideas for making a gossip/whistleblower tool that could become widely known and trusted.



Discuss

Open Thread Winter 2024/2025

26 декабря, 2024 - 00:02
Published on December 25, 2024 9:02 PM GMT

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.



Discuss

Exploring Cooperation: The Path to Utopia

25 декабря, 2024 - 21:31
Published on December 25, 2024 6:31 PM GMT



Discuss

Living with Rats in College

25 декабря, 2024 - 13:44
Published on December 25, 2024 10:44 AM GMT

When I was in college, I rented a group house with 5 other nerds. There were 5 bedrooms to divide among the 6 of us, so I negotiated an agreement where I paid less rent in exchange for sleeping in the hallway. This wasn't as bad as it sounds. My bedroom wasn't hallway-sized. It was bedroom-sized and even had a window, but a hallway ran through it, so the landlady put up a curtain between the "hallway" part of my room and the "bedroom" part of my room. Who needs 4/4 walls anyways?

One of my house-mates was named Emerson. Emerson had a friend named Stella who unintentionally bred pet rats. Stella's landlord wouldn't let Stella keep her pet rats in Stella's apartment, so Emerson offered to house them temporarily. "Temporarily" became "until the end of our two-year lease". Did our landlady allow caged rats? I don't know. I never asked.

Emerson lived upstairs. Upstairs had two bathrooms. Emerson put the giant rat cage in the bathtub of one of the bathrooms. Sometimes the rats would escape the cage and Emerson would have to cajole the rats out from under the sink. That was fine by me because I lived downstars.

When we had parties[1], I always warned guests, "Use the bathroom on the left. The bathroom on the right is full of rats". They were always very confused, as if "the bathroom is full of rats" meant something other than "the bathroom is full of rats". Sometimes they would look inside anyway and be surprised that the bathroom was full of rats.

  1. There was no music and no alcohol. These were "college parties" in the sense that the Zaatari Syrian War Refugee Camp, established in July 2012, is technically a summer home. My most vivid memory of these events was pausing Primer (2004) to examine the equations in the background. Our landlady loved us because we never damaged the property or provoked noise complaints. ↩︎



Discuss

What Have Been Your Most Valuable Casual Conversations At Conferences?

25 декабря, 2024 - 08:49
Published on December 25, 2024 5:49 AM GMT

I've heard repeatedly from many people that the highest-value part of conferences is not the talks or structured events, but rather the casual spontaneous conversations. Yet my own experience does not match this at all; the casual spontaneous conversations are consistently low-value.

My current best model is that the casual spontaneous conversations mostly don't have much instrumental value, most people just really enjoy them and want more casual conversation in their life.

... but I'm pretty highly uncertain about that model, and want more data. So, questions for you:

  • What have been your highest-value casual conversations, especially at conferences or conference-like events?
  • Is most of the value terminal (i.e. you enjoy casual conversation) or instrumental (i.e. advances other goals)? And if instrumental, what goals have some of your high-value conversations advanced and how?

Note that "it feels like there was something high value in <example conversation> but it's not legible" is a useful answer!



Discuss

Human-AI Complementarity: A Goal for Amplified Oversight

25 декабря, 2024 - 03:31
Published on December 24, 2024 9:57 AM GMT

By Sophie Bridgers, Rishub Jain, Rory Greig, and Rohin Shah
Based on work by the Rater Assist Team: Vladimir Mikulik, Sophie Bridgers, Tian Huey Teh, Rishub Jain, Rory Greig, Lili Janzer (randomized order, equal contributions)

 

Human oversight is critical for ensuring that Artificial Intelligence (AI) models remain safe and aligned to human values. But AI systems are rapidly advancing in capabilities and are being used to complete ever more complex tasks, making it increasingly challenging for humans to verify AI outputs and provide high-quality feedback. How can we ensure that humans can continue to meaningfully evaluate AI performance? An avenue of research to tackle this problem is “Amplified Oversight” (also called “Scalable Oversight”), which aims to develop techniques to use AI to amplify humans’ abilities to oversee increasingly powerful AI systems, even if they eventually surpass human capabilities in particular domains.

With this level of advanced AI, we could use AI itself to evaluate other AIs (i.e., AI raters), but this comes with drawbacks (see Section IV: The Elephant in the Room). Importantly, humans and AIs have complementary strengths and weaknesses. We should thus, in principle, be able to leverage these complementary abilities to generate an oversight signal for model training, evaluation, and monitoring that is stronger than what we could get from human raters or AI raters alone. Two promising mechanisms for harnessing human-AI complementarity to improve oversight are:

  1. Rater Assistance, in which we give human raters access to an AI rating assistant that can critique or point out flaws in an AI output or automate parts of the rating task, and
  2. Hybridization, in which we combine judgments from human raters and AI raters working in isolation based on predictions about their relative rating ability per task instance (e.g., based on confidence).

The design of Rater Assistance and/or Hybridization protocols that enable human-AI complementarity is challenging. It requires grappling with complex questions such as how to pinpoint the unique skills and knowledge that humans or AIs possess, how to identify when AI or human judgment is more reliable, and how to effectively use AI to improve human reasoning and decision-making without leading to under- or over-reliance on the AI. These are fundamentally questions of Human-Computer Interaction (HCI), Cognitive Science, Psychology, Philosophy, and Education. Luckily, these fields have explored these same or related questions, and AI safety can learn from and collaborate with them to address these sociotechnical challenges. On our team, we have worked to expand our interdisciplinary expertise to make progress on Rater Assistance and Hybridization for Amplified Oversight.

 

Read the rest of the full blog here!



Discuss

The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)

25 декабря, 2024 - 01:45
Published on December 24, 2024 10:45 PM GMT

This is a link to the latest Bayesian Conspiracy episode. Oliver tells us how Less Wrong instantiated itself into physical reality via LightHaven, along with a bit of deep lore of foundational Rationalist/EA orgs. He also gives a surprisingly nuanced (IMO) view on Leverage! 

Do you like transcripts? We got one of those at the link as well. It's an mid AI-generated transcript, but the alternative is none. :)
 



Discuss

Страницы