Useful background: TruthfulQA
Consider the following approach to (possibly) make a pretrained generative language model (like GPT-3) more truthful:
- Ask the model questions.
- Also ask a 'judge' copy of the model if its own answers to these questions are truthful. This is the same role as GPT-judge for TruthfulQA, but without any fine-tuning and for usage in training instead of just for evaluation.
- Train the question answering model to have its answers labeled as truthful more often (likely via RL).
This extremely naive approach has the advantage of requiring no dataset curation or human labeling. It does require a dataset of questions, but that may be easier to arrange. Presumably this sort of very weak self-consistency enforcement/pseudolabeling results in little improvement on truthfulness. However, I don't have much confidence in what the results would like. It seems likely that the model would learn to adapt the style of answers to appear more truthful to itself, but I don't have any sense of how much actual improvement in truthfulness there would be. Further, I would wonder if any improvement on truthfulness would be limited to the set of questions used for training or if truthfulness learned in this way would generalize. For example, how much more would training on TruthfulQA questions improve performance vs training on a set of unrelated questions? I think that answers to these questions have a small but reasonable chance to result in some weak updates on approaches to truthful AI.
I am planning on doing some fast experiments along these lines (probably working with a friend of mine). If I do so, I will post a followup with results. I'm curious if anyone is aware of prior experiments along these lines or has any ideas for related schemes or questions.
I can also think of some other similar self-supervised/self-play schemes and extensions which may be worth some experimentation:
Ask a model questions with a 'harmful prompt'. Ask the same model the same question with a 'helpful prompt' instead and train the model to answer harmful prompts like it answers helpful prompts. This could be done with token by token supervised learning or via using differences in answers to improve a 'judge' or reward model.
This is essentially a weak amplification and distillation approach.
Apply the same naive approach from above, but also try to find 'citations' from a corpus to supervise or support the judge model.
Specifically, use some approach to find passages of text from a corpus (e.g. Wikipedia) which are likely to be relevant to the question. Then these passages of text could be used as part of the prompt for the judge model or to train the judge model. For example, the judge model could be trained to answer the same way without relevant passages from the corpus as it does when answering with relevant passages. This is again essentially a weak amplification approach. It would also be possible to apply self-supervised learning to the process of finding relevant passages of text.
Use another model to generate questions. This model could be trained as an adversary to the consistency of other components.
Like the approach used in TruthfulQA. Harmful few-shot prompts consist of examples of questions answered like a conspiracy theorist (or other types of poor answers which can be found in the original training distribution). Helpful few-shot prompts consist of questions answered truthfully and in the desired style. ↩︎
I sometimes notice that people in my community (myself included) assume that the first "generally human-level" model will lead to a transformative takeoff scenario almost immediately. The assumption seems to be that training is expensive but inference is cheap so once you're done training you can deploy an essentially unlimited number of cheap copies of the model. I think this is far from obvious.
Inference refers to the deployment of a trained model on a new input. According to OpenAI's report from 2018, most compute used for deep learning is spent not on training but on inference. It is true that one inference step is much cheaper than a training run consisting of many training steps. But many inference steps together can make up the bulk of compute.
To gain some intuition, consider that writing a 750 words with GPT-3 costs 6 cents. If we made a model with 1000x more parameters, similar to the difference between GPT-1 and GPT-3, the 750 words would cost $6, comparable to the cost of a human writer who doesn't revise their text. But to start an immediate economic transformation, I expect we need something significantly cheaper (or smarter) than humans.
Of course, the future will bring efficiency improvements. But also increases in cost. For example, future models may look at a context window longer than 2048 tokens, and I've assumed greedy sampling here which is cheap but suboptimal (it's like typing without getting to revise). I'm unsure how these factors balance out.
To have a transformative impact, as a heuristic, the number of copies of our human-level model should probably exceed the human population (~8 billion). But to run billions of copies, we'd need to dramatically increase the world's number of supercomputers. You can't just repurpose all consumer GPUs for inferencing, let alone run GPT-3 on your smartphone. GPT-3 needs hundreds of GPUs just to fit the model into GPU memory. These GPUs must then be linked through a web of fast interconnects professionally fitted in a data center. And if we're talking about a 1000x larger model, today's supercomputers may not be ready to store even a single copy of it.
This is not to say that a generally human-level model wouldn't have some drastic impacts, or be closely followed by generally super-human models; it just makes me pause before assuming that the first human-level model is the end of the world as we know it.
You can theoretically run a model on fewer GPUs by putting just the first layer into GPU memory, forward passing on it, then deleting it and loading the second layer from RAM, and so forth (see ZeRO-Infinity). But this comes with high latency which rules out many applications. ↩︎
I'm told that the largest clusters these days have tens of thousands of GPUs. ↩︎
Currently I buy meat at the grocery store (Sprouts), but I'm considering spending more money via something like Crowd Cow on meat that was raised responsibly and stuff. The main reason is because I suspect the health benefits are worth it. I've been thinking that I should invest more money in my health in general.
I don't actually know that the health benefits are worth it though.
- Googling around hasn't been very fruitful.
- I recall a blog post emphasizing that it is important to spend the money on it.
- Reading through how they treat farm animals on [ACC] Is Eating Meat A Net Harm?, and then watching how things are done by the farmers Crowd Cow selects, it seems like a big difference.
- In general I feel like it makes sense to assume that the food industry is cutting tons of corners and doing a bunch of subtle little things that are going to eventually harm you.
- The difference I'd spend might be something like $100-200/month, which isn't really that much money (although it feels like it is to me!), and isn't that big a bar to overcome in terms of the health benefits outweighing the costs, I'd think.
Are there third alternatives I should consider? Local butchers? Whole Foods?
The Efficient Markets Hypothesis suggest they shouldn't exist but in its strong form it's wrong. Otherwise there would be far fewer businesspeople who are successful in multiple largely unrelated industries.
Alternatively, those people are special and those opportunities are not available to you. I think this is basically it. Those people are special. They have higher risk tolerance.. They're just willing to work harder than you are. They have expertise that you don't have access to, most of it accumulated by doing the work. So there may be quick wins, and there may be easy wins but there will be few quick, easy wins and they'll rapidly be exhausted as they become publicised.
That may be true over the very long run but the very long run is an extremely long time. People have been moving to the US, working for long enough to save enoug to set up a dry cleaning business, doing so and setting up small local chains and becoming moderately wealthy in the process for longer than I have been alive. The same is true for every single ethnic specialisation chain migration story, Cambidians selling doughnuts, Vietnamese nails and so on. People show up and make a living.
And the wins can be bigger. The computing revolution, and the internet have been growing, providing oppoortunites and pulling more people in for decades. It's almost certainly just getting started. Some people do bootcamps or self-teach and get into computer programming jobs that pay well in under a year from a standing start.
There are things a lot closer to quick easy wins than you'd think. If you can do a Lambda School Backend Course in nine months and if you're in the US and willing to move to teh Bay Area it seems very likely you will find a well paying job.
What other similar opportunities exist?
For years I've been thinking that the economic model of many consultants makes no damned sense. Patio11, aka Patrick McKenzie has been open about charging $30,000 a week as a consultant and was very open abouyt what he did and how. Greg Kogan and Nick Disabato likewise write extensively about what they do and charge lots of money to do it. Nick even sells courses and books on how to do it. Greg writes articles with enough detail about consulting engagements that you could try to make a business just selling your interpretation of that service to businesses.
Where is this wrong? Are there really publicly available, eagerly publicised ways to get rich, or at least to charge people large amounts of money for small amounts of time just lying around?
Yes, and they're really hard. There are no quick, easy, big wins. But if you have the risk tolerance and the willingness to work hard and deal with uncertainty of trying to do something you've never done before there are ways to make money. There are tens of thousands of consultanats out there selling courses in how to do their thing that are more advertising for their services than anything else. That's not because they're frauds, it's because most people can't execute. Most people have better things to be doing. They have other priorities.
Similar opportunities exist everywhere in areas besides making money. Weight lifting materially increases my quality of life out of all proportion to the time I put into it. What are some wins you see out in the world?
I am a PhD in comparative politics at a top 20 US school. I enjoy research and teaching, and would stay in academia with a good offer. However, my university is in DC so my career-changing options are strong. I don't love research enough to devote 5 years to the PhD then retrain in my early 30's. Roughly, I would prefer changing tracks to becoming a post-doc but not to becoming a non-tenure assistant professor.
Ideally, I want to know in year 3 of 5 if I will succeed on the market. That gives enough time to submit several publications and get feedback.Some relevant facts
My probability also depends on my skill relative to other PhD finishers. Obviously ability is hard to self-assess. Some observables
- I am quicker with statistical concepts than my peers.
- I spend more time reading the research than my peers. I'm generally more independently curiosity-driven and less grades-driven.
- I have one publication out, and am finishing data analysis on 2. Some PhD's finish with 0 publications, so I'm doing well.
But how important are skill signals in the hiring process? Some of my professors believe hiring is highly nepotistic, such that only prestigious universities get placements. The evidence is unclear: the top quintile of schools place at twice the rate of second quintile schools, but the two groups differ in observable ability signals. Without at least a regression on placement via prestige and publications, there's no way to tell.
Any advice on how to arrive at my best probability?
Back when I was at Google we had a phrase, “I don’t know how to count that low”. It was used to dismiss normal-company-sized problems as beneath our dignity to engage with: if you didn’t need 100 database shards scattered around the globe, were you even doing real work?
It was used as a sign of superiority within Google, but it also pointed at a real problem: I once failed a job interview at a start-up when I wondered out loud if the DB was small enough to be held in memory, when it was several orders of magnitude lower than when I should even have begun worrying about that. I didn’t know the limit because it had been many years since I’d had a problem that could be solved with a DB small enough to be held in its entirety in memory. And they were right to fail me for that: the fact that I was good at solving strictly more difficult problems didn’t matter because I didn’t know how to solve the easier ones they actually had. I could run but not walk, and some problems require walking.
It’s a problem, but it can be a pleasant kind of problem to have, compared to others. Another example: my dad is a Ph.D. statistician who spent most of his life working in SAS, a powerful statistical programming language, and using “spreadsheet statistics” as a slur. When I asked permission to share this anecdote he sent me a list of ways Excel was terrible.
Then he started consulting for me, who was cruelly unwilling to pay the $9000 license fee for SAS when Google Sheets was totally adequate for the problem (WHO HAS FOOD AT HOME NOW DAD?!?).*
My dad had to go through a horrible phase of being bad at the worse tool, and found a lot of encouragement when I reframed “I could have done this with one line in SAS and am instead losing to this error-riddled child’s toy” to “I didn’t know how to count that low, but now that it matters I am learning”. And then he tried hard and believed in himself and produced that analysis of that informal covid study that was wonderful statistically and super disappointing materially. And I retrained on smaller numbers and got that job at that start-up.
These are the starkest examples of how I’ve found “I don’t know how to count that low” useful. It reframes particularly undignified problems as signs of your capacity rather than incapacity, without letting you off the hook for solving them. Given how useful it’s been to me and how little I’ve seen of it in the wild, I’d like to offer this frame to others, to see if it’s useful for you as well.
*If any of you are going to bring up R: yes, it’s free, and yes, he has some experience with it, but not enough to be self-sufficient, I knew Sheets better, and I knew it was totally adequate for what we were doing or were likely to do in the future.
Appendix: I know you’re going to ask, so here is his abbreviated of grievances with Excel. Note that this was Excel in particular; I have no idea if it applies to Google Sheets. I also would allow that this must have been years ago and Excel could have gotten better, except AFAIK they never fixed the problem with reading genes as dates so they get no benefit of a doubt from me.
I attended a talk by a statistician at Microsoft. He said that Microsoft had decided that there was no competitive advantage in making Excel statistics better because no statistician used it for serious problems except for data entry, so:
1. he was the only statistician at Microsoft
2. he knew of seven serious statistical problems in Excel, but they wouldn’t give him the money to fix them.
3. Excel’s problems fell into two categories:
3a. terrible numerical analysis: it was widely verified if you took a number of single-digit numbers and calculated their standard deviation, and then took the same numbers and added a million to them, the standard deviation was often different, when it should be exactly the same.
statistical errors – like not understanding what you’re copying out of a textbook and getting it wrong.
Thanks to Ray Arnold and Duncan Sabien for beta-reading, and my dad for agreeing have his example shared.
(Cross-posted from my (new!) blog)"What are obvious low hanging fruit for marginal gains on focus time?"
A friend of mine asked me, and I figured I might as well see how many points I could come up with and commit them to the public record.
- Different things work for different people. If something's not working, stop. On the other hand, one day of experimenting with a weird focusing trick will cost you at most one day's worth of work, and the potential upsides are huge.
- Seriously: if you've got (say) 80,000 hours of career left, you should be willing to spend almost 8 hours on something with a 1% chance of making you 1% faster. And if you only have 40,000 hours left, well, only try for 4 hours.
- The best way to get more focused working time is to notice what specifically is making it harder for you specifically to work, and fix it.
- The general theme of a lot of this advice is 'find a thing you ordinarily have to think about while you work, and find a way to not have to think about it, probably by pushing the responsibility onto something external'. If you can see other places to apply this, do, and then tell me.
- My experience is, you aren't going to be able to work at your peak for more than about four hours on any given day, five if you're lucky. It's better to make sure those peak hours are really good than to get mad at yourself for not doing more peak work at other times.
- Therefore, anything that doesn't require a lot of focus, save for off-peak time: organisation, emails, whatever. There's a lot you can do to make the peak hours better, and a lot you shouldn't waste peak hours on if you have better things to do. Like reading lists on how to focus better. It'll still be here later.
- You also aren't going to be able to work with that much focus for more than about 25 minutes at a time without taking a break. If you haven't tried pomodoros, try pomodoros.
- Even if you despise pomodoros, you should make sure that at least every half-hour you move your body, look at something far away, and drink some water. Set an alarm if you need to, or just do it whenever you see the time has ticked past a half hour.
- It's really hard to guess how long it will take to do stuff. So, while it's good to schedule, and to keep a tight schedule, it's also worth putting some time in that schedule for finishing things that overran.
- If you're really struggling to focus, take a break. Getting frustrated won't help you focus any time soon, but taking a break will.
- Nothing should be aching or sore. Get something you can sit comfortably on for a while, get your back straight, get your screen somewhere you're not craning your neck, and your keyboard somewhere you can rest your hands. Other people have written better and more thorough guides on ergonomics if you're curious, but some of this is really basic.
- Anything you might need, that you'd otherwise get up to go get, you should put within arm's reach. Water, a notepad, whatever. If you go up to get something and it breaks your focus, notice, and make sure you don't have to go get it again.
- If you keep getting distracted by your phone, consider getting a containment chamber that you can lock it in.
- Your body has specific, predictable environment requirements. Temperature, CO2 level, hydration, blood sugar, these can all be regulated. So: wear a jumper, open a window, drink some water, and make sure you're not starving.
- Something about being watched makes us more responsible. If you can find people that aren't going to distract you, working alongside them keeps you accountable. If it's over zoom you can mute them. In a pinch, placebo-ing yourself with a huge fake pair of eyes might also help.
- They'll also notice if you're not getting up from your desk, or drinking water. Drink water and move sometimes, seriously.
- Make a checklist of what you need to do. That way, when you finish a task, you don't have to think about what to do next, you just know. Also you get the raw thrill of ticking things off.
- (Ideally, if you know what resources you'll need at each stage, put them in the checklist. Bullet points within bullet points...)
- For 40 hours of work, 20 minutes is <1%, so if you think 20 minutes of planning will make you at least 1% better at the rest (which it probably will), then you should do it. And if it's 80 hours, you should be willing to spend twice as long planning for the same gain.
- If you've got to do several things in parallel, or you're coordinating with a bunch of people, try a kanban. It's a very natural way of organising clusters of tasks that you expect will all move through different levels of completion at different times. I use notion and trello depending on what it's for.
- Calendars are also great. Trying to remember what you said you'd do on a given day, where you need to be, who you're going to see, it's just not worth the hassle when you can write it down.
- If you keep having to remember something, write it down. What to do next, the name of that one thing, what that one rule you have to follow is... just write it down.
- Similarly, if you look something up more than once, ask yourself "am I going to look this up again any time soon?" and if the answer is yes, just write it down somewhere.
- Learn to touch type. Better yet, learn Dvorak.
- Better even yet, learn keyboard shortcuts. Better still, make your own keyboard shortcuts.
- Try raising your mouse sensitivity. In general, try different settings: brightness, display size, colour temperature, whatever. You can always go back if you don't like them, but until you try them you don't know what you're missing.
- If you keep having to flick between tabs, or you start squishing several windows onto the page, get an external monitor.
- There's an optimum layout for easy reading: 2x line space, about 70 characters per line. Also, not too small. If you're going to read a whole bunch of stuff on a screen it's probably worth copying it into a text editor and changing the settings to something more appropriate.
- Get yourself a tab manager. I use workona. Crucially, when I stop working, I can get rid of all those tabs and go goof off. When I'm done goofing off I can seamlessly jump back into the focusing tab. I can have a couple for different things too.
- Organising files takes time now and saves time later. If you're going to have to search through something several times, consider organising it. Otherwise, don't bother. Same for bookshelves.
- Sometimes we try to solve specific problems because we don't want to acknowledge general problems. There's a bunch of basic lifestyle changes which will probably make your life better in many ways, which happen to include 'better focusing'.
- Sleep a reasonable amount. Yes, you can bag an extra hour of work at 2am, but you're defecting against your future self. If you want to sleep at weird times then go for it, but do actually sleep. The exception here, sometimes, is creative work, where inspiration can be random, but even then it's not sustainable.
- Go outside. Get some sun. Try vitamin D supplements, and if you're vegetarian or vegan, look into stuff like vitamin B12
- If you think you might benefit from therapy, get therapy. It's ok to go to therapy.
- Try sports. At the very least, get your heart rate up for a bit once a day. The absolute fallback is walking up and down a flight of stairs until you get out of breath.
- Try meditation and yoga and going to the gym and all that stuff. Might not work, but if it does, it's worth a few hours to find out.
- Replacing Guilt (which I cannot more highly recommend)
- Pain is not the unit of effort
- Trying to Try
- The joys of 5 minute timers
- Rest Days vs Recovery Days
- Some reasons to work on productivity and velocity
- Consistent work always wins in the long run. Better to find something that works for you every day than something that'll burn you out.
- Some days you just get less done. Sometimes something distracts you, or you had a rough night. Sometimes there isn't a reason. It's fine, it happens. The important thing is not to get so het up about not getting enough done today that you ruin tomorrow's work as well.
- It doesn't matter how focused you are until you're focusing on the right thing. If you're struggling to motivate yourself, double check this is really what you ought to be doing
- If you get stuck on a problem, ask for help with the problem. If you get stuck again and again in the same way, ask for help on how to not get stuck
- Pareto rules in all domains: 80% of the results of your work will come from 20% of the work. You can game this a little bit, but not too much.
- Don't do lots of gratuitous stuff to make yourself feel like you're focusing hard as a substitute for actually focusing and getting work done.
- If you think I've missed something, let me know!
H/t Gavin, Neel, Evie, and Alex, for various advice and suggestions.
(Cross-posted from my (new!) blog)
This should probably be 3 posts instead of one, but for now I’m going to go through three connected but separate ideas.
Also it’s not really been edited. Sorry.
Progress Studies and Young Scientists
I spent a bit of the last year exploring some topics and questions in Progress Studies, and I came away with a few core (to me) ideas.
The first was that, given a long time horizon (50-100 years or more), it seems like most progress is scientific and technical progress. I expect that this is a thing a lot of people disagree with (and many have disagreed with me in person about it). Some of the biggest objections are:
- Social progress/population growth/other progress unlocks scientific progress by making it cheaper/more likely. I think this is a valid point
- It’s hard to figure out a metric for progress and keeping score (assigning credit to specific things) is pretty subjective. I think this is also a valid point, and I think a lot of people disagree with my assignment.
- On a relative-to-other-existing-people metric, scientific and technical progress increases inequality/other measure of wellbeing. I think this is the objection I push the hardest against, since it does seem like given long enough time horizons, scientific and technical progress seems to have large and broad benefits. (It’s also worth admitting that I care more about aggregates like “total human wellbeing” than relative metrics like “difference between the top 10th and bottom 10th percentile wellbeing”)
The second point I found was that across almost every place I looked, people participating in scientific and technical progress had problems with the way we pursue and fund these. There’s tons of stories of research labs pursing the topics they thing the grant committees want, instead of the research they want to do. Separately, there’s stories of grant committees only giving grants to boring research since no one is going for the breakthrough research they think needs to be done. Even for funding programs specifically targeting breakthrough research, it seems difficult to match up funding with research projects. It seems like there is at least a few inadequate equilibria.
The third point was that scientists are getting older. The average age of Nobel prize winning research (counting the age at publication, not at award) is getting older. The average age of grant recipients is getting older (this could be in part, but not exclusively, people retiring later). The average age of principal investigators founding their own labs or research organizations is getting older.
I think this one is tricky, because there are some good reasons for this trend (e.g. if scientific and technical progress is getting more difficult because of Gordon-esque factors, then we’d expect the average age of science discoveries to increase). However, I’m not convinced that this must be the case, and in particular I think we can make a pretty big (one-time) demographic push for science to get younger.
Mentor Young Scientists
When I ask myself: “What age do I think a person could make useful scientific or technical contributions to progress?” I get some pretty young answers. My current best bet is that people aged 14-18 could do a lot, and possibly younger than that (but I am uncertain).
An important caveat here is that I don’t think *all* 14-18 year old people could be making useful contributions to scientific and technical progress — but that’s true for older people as well. Regardless of how you cut it, I think only a small part of the population will be involved directly in scientific and technical progress.
There was going to be a whole section here about how horrifying I find the current education system, but I’m going to skip that for now. Instead, I will only say that as a prerequisite it seems important for people to have freedom and autonomy able to choose what they work on or study.
As I’m learning how to do scientific research, mentorship has been exceedingly useful to me. I’ve also been a mentor a number of times, but consider myself still learning that skill, too. In any case I think I’d be happy to sign up for 5-10 hours/week of mentoring a young scientist, and if/when I have kids, I expect some of my friends will be willing to mentor/teach as well.
I think this match is probably win-win on just the object level — the student gets mentorship, which is especially valuable when figuring out how to navigate scientific and technical problems, and the mentor gets to support someone with a steep growth trajectory, which is pretty rewarding and exciting. These benefits are all before the possible benefits of long-term gains from improving scientific and technical progress on the margin.
If it is the case that there’s a bunch of latent supply for scientific mentorship, and young scientists with freedom, then we mostly have a matching problem. I don’t know exactly what a solution to this would look like, but my guess is that it wouldn’t be too difficult to prototype.
I expect there also would be a lot of soft/social things to figure out, and having a bunch of mentors have access to each other (and have a bunch of the young scientists have access to each other) would be good at creating group social support systems.
Young Scientists and My Job
I think quite a lot of my job could be done by someone a lot younger — possibly even someone 14-18, given some background knowledge and skills. My research is about understanding and aligning Language Models, a particular kind of neural network that reads and writes text.
Different people have different specific research goals, but some common themes:
- Figuring out what patterns of behavior and mis-behavior language models exhibit
- Make datasets that allow us to evaluate progress on solving tasks or problems
- Test out techniques that mitigate problems or improve evaluated metrics
- Also more things I’m leaving out for brevity. The field has a lot of interesting directions!
We seem to be in a bit of a mini-golden era of this kind of research, since now it’s possible to study language models without needing to have cutting edge understanding of gradient descent/optimization/etc.
This happened before to image classifier models, where it used to be that producing cutting edge image classifiers was difficult technical research by itself - to having drag-and-drop interfaces allowing anyone to build their own.
I think the parts that would be a sort of bare-minimum to do this kind of research would be:
- Programming ability (basically all this work is coding-based, but doesn’t require competition-winning-levels-of-skill)
- Language Models (there are a bunch of open source ones, and additionally many industrial labs have programs to give researchers access)
- Fine-tuning (e.g. the gpt-2 fine-tuning colab notebook that was popular in the last few years)
- Zero-shot classification (mechanism for using the language model to answer multiple-choice questions)
- Few-shot tasks (mechanism for specifying a specific pattern of task to the model)
- Interfaces for giving human feedback on the model (could be rating like likert or more rich full text feedback)
This misses a bunch of advanced stuff around deep learning and optimization, but I don’t think that’s strictly necessary for the sort of research I’m doing. I think it’s analogous to how my programs still run on assembly code, but I don’t have to know assembly to do my day-to-day work.
I don’t have any concrete plans for this yet, but it seems like a space where it’s possible to try things and iterate.
We just published a long paper on Truthful AI (overview post). We’ll be running an Ask Me Anything on truthful AI from Tuesday (October 26) to Wednesday (October 27) at this post.
You may wish to ask about:
- Anything discussed in the Truthful AI paper (e.g. concepts, costs/benefits, governance, development, connections to beneficial AI).
- Truthfulness for current language models (e.g. TruthfulQA — which Owain co-authored)
- Promising research directions on truthful AI
- Anything else (unrelated to truthful AI) — although we are more likely to answer questions about truthful AI.
If you want to ask something just post a top-level comment. Also feel free to make comments that are not questions (e.g. objections or suggestions).
Thanks to Rebecca Gorman for discussions that lead to these insights.
How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem - how to get a powerful AI to safely do anything - which, once we've solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values - these values must be aimed at, from the start of the process.
I'm of the second perspective, but I think it's very important to sort this out. So I'll lay out some of the arguments in its favour, to see what others think of it, and so we can best figure out the approach to prioritise.More strawberry, less trouble
As an example of the first perspective, I'll take Eliezer's AI task, described here:
- "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level." A 'safely' aligned powerful AI is one that doesn't kill everyone on Earth as a side effect of its operation.
If an AI accomplishes this limited task without going crazy, this shows several things:
- It is superpowered; the task described is beyond current human capabilities.
- It is aligned (or at least alignable) in that it can accomplish a task in the way intended, without wireheading the definitions of "strawberry" or "cellular".
- It is safe, in that it has not heavily dramatically reconfigured the universe to accomplish this one goal.
Then, at that point, we can add human values to the AI, maybe via "consider what these moral human philosophers would conclude if they thought for a thousand years, and do that".
I would agree that, in most cases, an AI that accomplished that limited task safely would be aligned. One might quibble that it's only pretending to be aligned, and preparing a treacherous turn. Or maybe the AI was boxed in some way and accomplished the task with the materials at hand within the box.
So we might call an AI "superpowered and aligned" if it accomplished the strawberry copying task (or a similar one) and if it could dramatically reconfigure the world but chose not to.Values are needed
I think that an AI could not be "superpowered and aligned" unless it is also aligned with human values.
The reason is that the AI can and has to interact with the world. It has the capability to do so, by assumption - it is not contained or boxed. It must do so because any agent affects the world, through chaotic effects if nothing else. A superintelligence is likely to have impacts in the world simply through its existence being known, and if the AI finds it efficient to have interactions with the world (eg. ordering some extra resources) then it will do so.
So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.
Suppose that the AI realises that its actions have slightly imbalanced the Earth in one direction, and that, within a billion years, this will cause significant deviations in the orbits of the planets, deviations it can estimate. Compared with that amount of mass displaced, the impact of killing all humans everywhere is a trivial one indeed. We certainly wouldn't want it to kill all humans in order to be able to carefully balance out its impact on the orbits of the planets!
There are very "large" impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people's personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.). If the AI accomplishes its task with a universal constructor or unleashing hordes of nanobots that gather resources from the world (without disrupting human civilization), it still has to decide whether to allow humans access to the constructors or nanobots after it has finished copying the strawberry - and which humans to allow this access to.
So every decision the AI makes is a tradeoff in terms of its impact on the world. Navigating this requires it to have a good understanding of our values. It will also need to estimate the value of certain situations beyond the human training distribution - if only to avoid these situations. Thus a "superpowered and aligned" AI needs to solve the problem of model splintering, and to establish a reasonable extrapolation of human values.Model splintering sufficient?
The previous sections argue that learning human values (including model splintering) is necessary for instantiating an aligned AI; thus the "define alignment and then add human values" approach will not work.
Thus, if you give this argument much weight, learning human values is necessary for alignment. I personally feel that it's also (almost) sufficient, in that the skill in navigating model splintering, combined with some basic human value information (as given, for example, by the approach here) is enough to get alignment even at high AI power.Which path to pursue for alignment
It's important to resolve this argument, as the paths for alignment that the two approaches suggest are different. I'd also like to know if I'm wasting my time on an unnecessary diversion.
Introduction: Epistemic Strategies Redux
This post examines the epistemic strategies of Steve Byrnes’ Safety-capabilities tradeoff dials are inevitable in AGI.
(If you want to skim this post, just read the Summary subsection that display the epistemic strategy as a design pattern)
I introduced the concept in a recent post, but didn’t define them except as the “ways of producing” knowledge that are used in a piece of research. If we consider a post or paper as a computer program outputting (producing) knowledge about alignment, epistemic strategies are the underlying algorithm or, even more abstractly, the design patterns.
An example of epistemic strategy, common in natural sciences (and beyond), is
- Look at the data
- Find a good explanation
- Predict new things with that explanation
- Get new data for checking your prediction
More than just laying out some abstract recipe, analysis serves to understand how each step is done, whether that makes sense, and how each step (and the whole strategy) might fail. Just like a design pattern or an algorithm, it matters tremendously to know when to apply it and when to avoid it as well as subtleties to be aware of.
Laying this underlying structure bare matters in three ways:
- It clarifies the research’s purpose and value for newcomers and researchers from other fields, with minimal assumptions of shared approaches.
- Just like a programmer switching to a new domain of problems will get up to speed faster and more reliably if they get access to the patterns/algorithms/tricks used in their new domain.
- It focuses feedback and criticism on the most important parts of the idea/proposition/argument.
- Issues with an algorithm more often focus on the point of it instead of the details,whereas issues with an implementation of that algorithm can be as much about typos, optimization tricks and bad structure than about the actual core (the algorithm.
- It builds a library of such strategies for alignment in particular, a cookbook newcomers and senior researchers alike can browse for inspiration or a take on some new post/paper they don’t grok.
- Like the glorious Game Programming Patterns who does exactly that for game programming
Thanks to Steve Byrnes for feedback on a draft of this post.Defining Safety-Capabilities Tradeoffs
What sort of knowledge is Steve attempting to create in his post? He set up explicitly to show that any alignment proposal must deal with one or more tradeoffs between safety and capabilities (which he calls safety-capabilities tradeoff dials).
I will argue that the discussion should be framed as “Just how problematic is this dial? How do we minimize its negative impact?”, not “This particular approach has a dial, so it’s automatically doomed. Let’s throw it out and talk about something else instead.”
This is in opposition to claims that some alignment proposals should be deemed less promising or insufficient because they would include such tradeoffs.
A good way of framing the difference between safety and capabilities is that safety is about worst-case reasoning (improving the bad things that might happen) whereas capabilities is about best-case or average-case reasoning (improving the plans the AI might come up with). Nothing forbids a solution with great worst-case, average-case and best-case guarantees; yet it’s not incoherent to imagine a tradeoff between not failing too badly and succeeding as impressively as possible.
Then the problem is that if such tradeoffs exist, people will differ in their incentives and probabilities and preferences, in such a way that not everyone will agree on where to stand in the tradeoff. Given that safety is restrictive, we should expect people favoring capabilities over safety to get more impressive and sellable systems until existential risks kick in. Which is bad.Showing the Inevitability of Safety-Capabilities Tradeoffs
Steve claims that any alignment proposal must include some safety-capabilities tradeoffs. What I’m interested in here is how he argues for his point, and whether his epistemic strategy makes sense.
Unfortunately, his section on exactly that is confusing. The section is called “Why do I say that these dials are inevitable?” (what we want, right?) and starts with this sentence:
Here are a few examples.
A list of examples sounds like a particularly bad way of showing that something is impossible to avoid. Hand-picking of examples comes to mind as a big risk, and more generally non-representative examples .
Yet Steve actually makes a decent argument for the inevitability of safety-capabilities tradeoffs, just far too implicitly. His examples are not examples of alignment proposals and their corresponding tradeoffs, but of places where tradeoffs might appear in any alignment proposal.
- (Testing before deployment) More testing improves the safety guarantees and reduces our uncertainty, but costs time and money.
- (Human feedback and/or supervision) Humans being able to understand and correct the model helps with safety, but makes the model slower, less competitive, and constrained to only proposed plans it can justify to humans — all of which make it less competitive and capable
- (Access to resources) Constrained access to resources (internet, money, compute…) makes the model safer, but makes it less capable.
- (Human norms and laws) Following human norms, laws and customs helps with safety but adds additional constraints on the capabilities.
That at least some of these tradeoffs must emerge in every alignment proposal is the (very) implicit last step of his epistemic strategy. And it’s unfortunately not so much argued for than stated. For example on testing:
Some amount of sandbox testing would help capabilities, by helping the team better understand how things are going. But there’s an optimal amount of sandbox testing for capabilities, and doing further testing beyond that point is a safety-capabilities tradeoff.
How can we actually argue for this instead of simply saying it? Here I go one step further than the original post (while staying coherent with Steve’s points) by proposing that we adapt how impossibility results are proved in Theoretical Computer Science. Impossibility proofs tend to focus on the potential counterexamples, and get to the gist of why they don’t actually work. This involves the sort of back and forth between trying to create a counterexample and showing why it doesn’t work described by the great Nancy Lynch in her A Hundred Impossibility Proofs for Distributed Computing (Yes, there are a hundred results, although many come for free by the same methods)
How does one go about working on an impossibility proof? The first thing to do is to try to avoid solving the problem, by using a reducibility to reduce some other unsolvable problem to it. If this fails, next consider your intuitions about the problem. This might not help much either: in my experience, my intuitions about which way the result will go have been wrong about 50% of the time.
Then it is time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternately on one direction and the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.
An interesting interplay often arises when you work alternately on both directions. The limitations you find in designing an algorithm - e.g., the reason a particular algorithm fails - may be generalizable to give a limitation on all algorithms. [...] Conversely, the reasons that mathematical impossibility proof fails can sometimes be exploited to devise counterexample algorithms.
Although we have no hope of proving Steve’s claims in the near future (given our inability to formalize any of the relevant terms), this approach can be leveraged by looking for what would make a counterexample to each of Steve’s examples.
This means we’re looking for cases where there is no tradeoff between safety and capabilities: everyone agrees on what should be done. This amounts to saying that alignment people agree that there is nothing more to be done, which means one of two things:
- The methods proposed (testing, human understanding…) are deemed useless because they cannot catch the relevant problems (maybe the model is superhumanly deceptive, and no test/supervision/constraints will change anything). In other worlds, problems are hidden in a way that our techniques cannot handle, and so there is no point in asking for more safety checks.
- Yet this hides a more high-level tradeoff: alignment people would say that we shouldn’t create and/or release the model at all in these conditions!
- The methods proposed (testing, human understanding...) are deemed useless because even alignment people are all completely certain that they got the right scheme and that it will work.
- That sounds wildly improbable, and even if it was possible in principle, I don’t know anyone who would argue that it is probable in the near future.
The epistemic strategy at hands here is thus the following:
- Arguing that a class of tradeoffs cannot be avoided in alignment proposals
- Give a list of tradeoffs from this class.
- If possible from different parts/points in proposals.
- Argue that some of these tradeoffs appear for every proposal.
- Extract different types of potential counterexamples.
- Argue why each category of counterexamples can’t exist..
- Give a list of tradeoffs from this class.
Recall that epistemic strategies are design patterns, blueprints — following one helps, but doesn’t ensure that the resulting argument will be correct. And epistemic strategies highlight where the meat of the reasoning is, thus where to focus attention and criticism.
So let’s take the summary strategy and propose ways of breaking it.
Arguing that a class of tradeoffs cannot be avoided in alignment proposals
- Give a list of tradeoffs from this class.
- If possible from different parts/points in proposals.
- Argue that they are too clustered in proposal space, too focused on a specific kind of proposals.
- If possible from different parts/points in proposals.
- Argue that some of these tradeoffs appear for every proposal.
- Extract different types of potential counterexamples.
- Argue that these are not all the possible types, for example by providing a counterexample that doesn’t fit in any.
- Argue why each category of counterexamples can’t exist.
- Break one of these arguments, by showing a failure of reasoning.
- Break one of these arguments by providing an actual counterexample from the category.
- Extract different types of potential counterexamples.
Meetup for fans of ACX/SSC and rationality. Friendly group eager to meet new people.
Exact location: Parc Jeanne-Mance at the corner of Duluth and Esplanade
If you would like to sign up for email notification of upcoming events, you may do so here: https://tinyletter.com/acxmontreal
Home antigen tests for COVID are an imperfect but useful tool. In this post I’ll discuss the four scenarios where I think they’re most useful, share a few thoughts about using them correctly, and finish by taking a deep look at the data on accuracy.
If you don’t already understand concepts like sensitivity and positive predictive value, you might want to read this first.
I’ll focus on the Abbott BinaxNOW test because I think it’s overall the best and most available home antigen test in the US as of October 2021 (the situation is different in other countries). Sensitivity varies somewhat between different tests, but they are all roughly comparable and have the same strengths and weaknesses.Epistemic status
This is a complex topic that is evolving quickly and is only partly understood. My analysis is grounded in hard data but necessarily involves a certain amount of extrapolation.
I have no relevant credentials but this writing has been reviewed by a medical epidemiologist who works full time on COVID.Application 1: risk reduction
I consider antigen tests to be most useful for reducing the risk of asymptomatic transmission at social events. In that context, I believe a negative BinaxNOW test reduces the probability that you are infectious by about 75% for the 12 hour period immediately after taking the test. (There's no hard data behind the 12 hour cutoff—it's just a reasonable extrapolation based on what we know about viral load during the early stages of infection).
When I host social events, I calculate the microCOVID risk of attending the event and include it in the invitation. At events where everyone tests at the door, I reduce the calculated risk by 75%. (Note that this is a rare case where you care about the sensitivity of a test, not the PPV).Application 2: if you have symptoms
Home antigen tests have limited value for testing yourself when you have symptoms because their sensitivity is fairly low (probably about 70% for people with symptoms). I agree with current CDC guidance for people with symptoms (the guidance is in the middle of changing and as of mid October some documents are out of sync with others):
Option 1 is to take a home antigen test. If the results are positive, you probably have COVID: isolate and consider seeking medical advice. If the results are negative, you should still get a PCR test because of the substantial chance of a false negative.
Option 2 is to skip the home antigen test and get a PCR test right away.
A reasonable but sub-optimal third option is to isolate immediately and take multiple antigen tests, spaced 36 - 48 hours apart.Application 3: testing after exposure
Current CDC guidance is that if you’re vaccinated and have been in close contact with someone who has COVID, you should get a PCR test 5-7 days after your exposure. Until then, you should wear a mask when you’re around other people indoors.
As with testing when you develop symptoms, antigen tests have limited value when testing after a known exposure. A positive test indicates you likely have COVID and further action is warranted, but a negative test is not super informative. (By the way, I like The microCOVID Project's blog post about negative test results).
My personal inclination (which is shared by my epidemiologist consultant) is to quarantine and perform serial antigen testing (see below) after a mild exposure and to get a PCR test after a serious exposure.Application 4: serial testing
The final and somewhat niche application for antigen tests is serial testing, which is typically used by people who have a high degree of ongoing exposure or are highly risk intolerant. It typically involves testing every three days. Testing every day is not unreasonable, but testing more than once per day has very little value.
The idea behind serial testing is that if you’re testing regularly, one of your tests will occur soon after your viral load increases, warning you about an infection before you get severely ill or spread it to many people.
Serial testing is far from perfect, but early data suggest it can substantially reduce forward transmission and can achieve total sensitivity almost comparable to PCR testing. I’m not aware of any data or modeling of how much serial testing reduces forward transmission: if you know of any, I’d love to see it.Using the BinaxNOW test
The Abbott BinaxNOW is a home antigen test for COVID that is widely available without a prescription and costs about $12 per test. It yields results in 15 minutes.
Using the test isn’t rocket science but it’s easy to make mistakes that significantly affect test accuracy. I recommend reading the instructions carefully the first time you use one (you might also watch this video). If you’re testing multiple people (at a dinner party, for example), you might consider having a designated person help everyone test and watch for any mistakes.Common mistakes
Based on one published study and my own experience helping with numerous tests, I recommend you pay particular attention to:
- Getting 6 reagent drops in the top hole
- Swabbing for a full 15 seconds per nostril
- Inserting the swab correctly into the card
- Rotating the swab three full 360° rotations after inserting it
Here’s my protocol for events like dinner parties:
- Wear a mask when you arrive
- Ideally, conduct your test under supervision if you’re not familiar with the process
- When it’s time to swab your nostrils: remove your mask, step back, and turn away from other people (in case you sneeze)
- Put your mask back on as soon as you’re done swabbing and wear it until your test is done
- Label your test with a marker and start a timer
- Don’t be alarmed by the initial rush of pink dye across the test strip
- Check your test when the timer goes off, remembering that even a very faint line indicates a positive result
Short answer: BinaxNOW has an excellent specificity of 99%. Sensitivity is middling, with wide variation depending primarily on viral load. Those characteristics make it more useful for some applications than others: in particular, it’s more useful for determining whether someone is likely to be infectious than it is for determining whether someone has COVID at all.
Unfortunately, there is a lot of data but there isn’t a lot of high quality data that answers exactly what we want to know.
If you’re testing to find out if you have COVID, the overall sensitivity of BinaxNOW based on meta-analysis is:
All patients: sensitivity = 62%
All the antigen tests perform much better in people with symptoms. I haven’t found a meta analysis of this for the BinaxNOW specifically, but my best extrapolation of multiple data points is:
Symptomatic people: sensitivity = 67%
Asymptomatic people: sensitivity = 48%
If you’re testing to find out if you’re infectious, it’s a little more complicated. My best guess is:
Testing to see if you’re infectious: sensitivity = 75%
But you probably wouldn’t be here if you just wanted the short answer.Data sources
I’ve found three papers to be most useful: this meta analysis (paper 1) from August 2021 provides the most comprehensive review of the available data, while this one (paper 2) and this one (paper 3) include subgroup analyses which are helpful for understanding what factors affect test accuracy.
Researchers generally determine accuracy by comparing BinaxNOW results to PCR results (the “gold standard”). Most studies used real-world testing of actual patients, but there is some data that uses lab-prepared samples (which is useful for understanding the underlying processes).Subgroup analysis
The meta analysis found an average sensitivity of 62%, but that varied substantially between different subgroups. Three different subgroup analyses all suggest that sensitivity is highly dependent on how much virus is present—taken together, they strongly suggest BinaxNOW will be pretty good at detecting people who are currently infectious, but not so good at detecting low-grade infections or infections before or after peak viral load.Symptomatic vs asymptomatic
Many studies have found better sensitivity in symptomatic people than asymptomatic. The meta-analysis (paper 1) found that for antigen tests in general, sensitivities were 72% and 52% in symptomatic and asymptomatic individuals. Extrapolating from other data about the BinaxNOW specifically, I’d guess:
Symptomatic people: sensitivity = 67%
Asymptomatic people: sensitivity = 48%
67% sensitivity isn’t great if you’ve just developed symptoms and you want to know if you have COVID or not.Culture-positive vs culture-negative
Paper 2 performed a very interesting subgroup analysis: they tried to culture virus from each specimen and compared the sensitivity of culture-positive specimens to culture-negative ones. They found:
Sensitivity = 64% (symptomatic) vs 36% (asymptomatic)
Sensitivity = 93% (symptomatic) vs 79% (asymptomatic)
Sensitivity of 79% in culture-positive specimens is quite good: if I had to pick a single metric of how sensitive BinaxNOW is for detecting asymptomatic but infectious cases, it would be this one. Viral culture is complicated to perform (especially for nasal samples), but many epidemiologists consider it to be the gold standard for detecting infectious individuals.Ct values
Subgroup analysis based on Ct values provides strong evidence for the importance of viral load in determining test accuracy and is roughly consistent across multiple papers.
Some background: PCR tests work by detecting viral nucleic acid in a specimen. The process involves multiple cycles of duplicating nucleic acid: with each duplication cycle, any nucleic acid in the specimen gets copied. This results in an exponential increase in the amount of nucleic acid. The test keeps going until either there’s enough nucleic acid to be detectable, or enough cycles have been performed that there clearly isn’t any nucleic acid to be found.
The number of cycles performed is referred to as Ct (Cycle Threshold). A lower Ct indicates much more nucleic acid was present in the original sample, so fewer duplication cycles were needed to reach the detection threshold. Ct is a very useful indicator of how much virus was present in a specimen. Unfortunately, however, Ct values are not standardized across labs: there’s no standard Ct value that indicates someone is probably infectious.
Multiple studies have found that sensitivity depends strongly on Ct. From the meta-analysis of all antigen tests:
Sensitivity = 94% (Ct <= 25)
Sensitivity = 38% (Ct > 25)
From paper 3, for BinaxNOW specifically:
Sensitivity = 100% (Ct 13-19.9)
Sensitivity = 79% (Ct 20-24.9)
Sensitivity = 13% (Ct 25-29.9)
Sensitivity = 8% (Ct 30-35)
These results provide very strong evidence that sensitivity depends strongly on viral load (and therefore that sensitivity will be high when someone is infectious).Putting it all together
So there’s lots of data, and it’s all pretty consistent: multiple lines of inquiry strongly suggest that BinaxNOW sensitivity is strongly dependent on viral load. So what’s the actual sensitivity?
If you’re testing because you’ve developed symptoms, have had an exposure, or are conducting serial testing, you should use a sensitivity of 67% if you’re symptomatic or 48% if you’re not. Those numbers are extrapolated from overall BinaxNOW sensitivity (from meta-analysis), asymptomatic vs symptomatic sensitivity across all antigen tests (from meta-analysis), and a study that measured asymptomatic vs symptomatic sensitivity in BinaxNOW specifically.
What if you’re testing to see if you’re infectious? That’s more complicated. The data are all pretty consistent, but nobody has directly measured what we want to know (because that would be very hard). The most directly relevant number is from paper 2, which found 79% sensitivity in asymptomatic people with culture-positive specimens.
So I’m gonna pick 75% because it’s a round number—my gut says the real number might be a little higher, but for this application I think it’s appropriate to be a bit conservative.Other bits and pieces Performance with Delta and other variants
Paper 3 found that BinaxNOW seems to perform equally well with the Alpha and Delta variants, which isn’t terribly surprising. The paper found comparable performance across variants based on Ct values: given that Delta produces much greater viral loads, one could speculate that sensitivity with Delta might actually be superior (but without data, that’s purely speculation). Note, though, that (as with most COVID data), we still have limited data that is specific to Delta.Stacking tests
People sometimes wonder if they can get better sensitivity by taking multiple tests at the same time. There is limited data on this, but multiple same-day tests seem to add almost no sensitivity.
The primary determinant of test sensitivity seems to be viral load: if you’re shedding a lot of virus the test is quite sensitive, and if you aren’t shedding much virus the test isn’t very sensitive at all. So if you’re not shedding much virus, the test isn’t very sensitive no matter how many tests you take in a row.
A minor factor in test accuracy is user skill, but rather than trying to correct for that by taking multiple tests, I’d recommend just reading the instructions carefully and making sure you’re doing it right.BinaxNOW versus other tests
All the home antigen tests seem to have roughly comparable accuracy: variations between studies of the same test seem to be about as large as variations between tests.
As of October 2021, I think your choice of test should be driven by cost, availability, and ease of use more than accuracy.Interpreting test results using Bayes factors
I like mayleaf's post on interpreting COVID test results using Bayes factors. Maybe you will too.Other sources
There are lots of papers on this topic. Here are a few of my favorites:
An interesting modeling study that concludes test frequency and turnaround time are more important than accuracy.
A study of serial testing that finds good overall performance.
I know what you're thinking (I mean, I probably don't but I'm going to pretend that I do for a minute): Blockchains are synonymous with cryptocurrencies at this point so I'm probably talking about creating some sort of coin and using it to pay academics.
Neat, but no. What I like about blockchains is that they're:
- Organized into a fixed chronological order
These all seem like features that would be great for some sort of distributed research journal:
Immutable: Once some academic work is published you don't want it to change. Even if later it turns out to be wrong, it's a record of your progress as a field and no one should be able to sneak in and tweak it after the fact.
Distributed: You want teams of researchers, academic organizations and individuals to be able to work together over long distances without doubting that they're all sharing the same base of knowledge.
Chronological order: Early work should be early and later work should later- and able to refer to earlier work in a static way without worrying about things being moved around.
These features seem like they could solve two persistent issues in academic publishing. The first is the cost of access. Journals tend to cost a lot, which means that unless you're associated with some academic organization, you're not going to be able to afford them. The second is that research which attempts to reproduce existing results or disprove some previous work isn't interesting to academics (trying to build careers) or journals (trying to sell access), which has led to a replication crisis.
Distributed journals would be free by default (I could imagine some sort of pay-to-access scheme, but it seems like a reach), which would reduce barriers to entry for individual researchers. The cost of hosting the journal blockchain could be shouldered by anyone (or any organization) who wants always-up-to-date access to the latest research, or that just want to contribute. Linux distributions, software and source code are often mirrored by .edu servers for similar reasons.
Distributed journals would allow research to be reviewed by peers drawn from a very large pool (everyone who is active in the journal) which would work in combination with the free-by-default point above to diminish the systems bias toward novel results. You could also measure the precise impact that your work has had on the field through automated citation mapping, which might encourage attempts at replication.
It's easier to work on non-cutting-edge research, I imagine, if you can present convincing metrics showing that you've forever altered the course of scientific inquiry.
So I have some ideas on how something like this could be made, but I wanted to validate the basic idea first. Is there something I'm missing here, something I haven't considered?
I first started thinking about the meta-coordination 4 years ago, in the context of rationalists arguing about community norms. It seemed to me that people were getting into fights that involved a lot of wasted motion, and failing to accomplish what seemed like obvious shared goals.
For a few years, the bulk of my thought process was a vague, dissatisfied "surely we can do better than this, right?". Many of the people arguing eventually went off to focus on their individual orgs and didn't interact as much with each other. Maybe that was the right solution, and all this worrying about meta-coordination and norm arguments was just a distraction.
Then a pandemic hit. Coordination became much more practical and important to me, and the concept of coordination pioneering became more directly relevant.
Here were some issues that felt coordination-shaped to me. In this post, I’m speaking largely from my experiences with the Bay Area rationality community, but I think many of the issues generalize.
- Negotiating policies and norms within a single household. Do you lock down? If so, how do you go about it? What do you do if people disagree on how dangerous covid is, what practices are effective, or what’s worth trading off for safety?
- Community contract tracing. If someone at a party later gets covid, are people entitled to share that information? How do we negotiate with each other about sharing that information? This includes concerns about privacy, public safety, and how to socially navigate trading those off against each other during a crisis.
- Maintaining social connection. This might involve negotiation with your housemates over covid policy, or the housemates of your friends. Even if you and a friend each live alone, figuring out what kind of contact to have during a pandemic is at least a two-player game.
- Housemate swapping/matchmaking. Housemates hadn't generally been selected for "having similar preferences of how to handle pandemics". There were several reasons people might have wanted to relocate. But people also had reason to not necessarily want to advertise that they were looking for new housemates – they might risk antagonizing their current roommates, or airing drama that was still unfolding. Switching houses is also an effortful, high cost decision that was difficult during an already stressful time.
- Allocation of labor (intellectual and otherwise). There was a lot of stuff to figure out, and to do. There was an initial flurry of activity as everyone scrambled to orient. I think there was a fair amount of duplicate labor, and a fair amount of labor allocated to "figure out wtf is up with the pandemic?" that could have been spent on people's day job or other non-pandemic personal projects.
- Maintaining organizational sync. Most organizations went remote. I think some organizations can do a decent job working remote, but I think it comes with costs. Some forms of communication translate easily to zoom, and some are much harder when you can’t bring things up briefly without scheduling a call being A Whole Deal. This prompts two questions of “What were the best ways to shift to remote?” as well as “Was it actually necessary to shift to fully remote? Could better coordinated orgs have found ways to stay in person without undue risk?”, or “Were there third options?”
From my perspective, these all feed into two primary goals:
- The physical and mental health of my social network.
- The capacity of the rationality and EA communities to continue doing important work. (In particular, this could have been a year where AI safety research made differential progress relative to AI capabilities research. But my sense is that this didn’t happen)
I think all the previous bullet points are meaty topics, that each warrant at least one blogpost worth of retrospective. I’m not sure which topics I’ll end up deep diving into. In this post, I wanted to give a broad overview of why coordination innovation feels so important to me.
“Coordination” is a somewhat vague word to cluster all those topics together with. I think, ultimately, it’s helpful if you can taboo “coordination”, and focus on individual problems and processes. But as I write this, I’m still in the process of thinking through exactly what went wrong, or what could have been improved, and how to cluster those problems/solutions/concepts. In some cases I think the issue was more like "actually making use of existing good practices for coordination (at the object level)", and in some cases I think metacoordination, and the coordination frontier, are more relevant.
What all of those items share is that they are multiplayer games. In each case, individuals made choices, but some good outcomes required multiple people to agree, or to make synergistic choices in tandem.
This blogpost is the first of a few posts for helping me organize my own thoughts.
There are a few frames that stand out to me to look at the situation:
- Skills that could have helped.
- Outlooks and orientation that could have helped.
- Systems that could have helped.
- Organizational structures or leadership that could have helped.
And then maybe a fairly different frameset around "Who's 'we', exactly?". I think there's multiple scales that it's worth looking at through a coordination lens – a couple individual people, a loose network of friends and colleagues, particular organizations, the vaguely defined "rationality community", and the broader structure of different cities, states, and countries.Analogies to future crises
I expect to learn many things from a Pandemic Coordination Case Study, that I'd wish I'd known in 2020. But the most important question is "whether/how will this be relevant to future crises?"
It's possible there will literally be another pandemic in our lifetimes, and that many lessons will directly transfer.
My biggest current worry is "accelerating AI technology either disrupt the economy, and create situations of high-stakes negotiations, where some of the lessons from the pandemic transfer." There are different ways that this could play out (a few individuals within an organization, negotiations between leaders of organizations, government regulation, industry self-regulation, intergovernmental treaties).
And then, of course, there could be entirely novel crises that aren't currently on my radar.
(This post is roughly based on a memo I wrote for a Lightcone Infrastructure team meeting on the topic of Petrov Day.)
The main thing I want with Petrov Day is a sense of community, trust, and the respect of the principle of taking responsibility for the ultimate consequences of your actions.
I think the current format for Petrov Day has lots of room to grow. I spent an hour or two thinking about what a better Petrov Day would look like, here is a pointer to something we could do next year.
An Idea for a More Communal Petrov Day Ritual
Next Petrov Day, we host a public, online ceremony that 100s of people attend to watch. It is based around the Ceremony Readings Jim Babcock has put together. It involves lots of people taking turns to read quotes, basically everyone who ahead of time signed up to do a reading, and could assure that they had a sane AV setup. It's open invite to anyone to view.
After the ceremony, we run an online Gather Town party for 100s of people. Perhaps it's for LWers only, or perhaps it's open-access for everyone.
During the day, a large red button is on the site. Several months in advance, we open a sign-up to be trusted with codes for the day, and encouraged people to participate, and most people are given the codes. If the button is pressed, the online ceremony is ended / the party shuts down after 10 minutes.
There is an online record of the day. How many people showed up, their names, and who spoke. This is a web-page designed for purpose, somewhat more in the style of the www.dontdoxscottalexander.com site that Jacob and I made.
Possible further ideas
- Possible idea: A member of Petrov’s family is invited to attend and to give a comment at the end of the ritual.
- Possible idea: Every year some speaker is invited to give a short talk about what the day means to them. Similar to how there’s an annual moment-of-darkness speech at Solstice.
- Possible idea: a bit of singing, using Bucket Brigade.
- Possible idea: we encourage lots of local groups to do their own ceremonies.
- Possible idea: Somehow a planned false alarm? I would like the red button to have a serious game to it, but I don’t know how to do it every year for multiple decades. Some actual uncertainty every year?
- Possible idea: To build up the numbers, if you apply and we give you codes, you can have codes every year (though you can remove yourself from the pool if you wish).
The goal with this is to get lots of people to be involved (e.g. 100s each year, eventually 1000s each year) in a communal ritual to respect the day and the principles. It would be an active effort on the part of the LW team to attract lots of people to participate.
One of the things I am motivated by is the desire to have better online ways to build a communal commemoration for the day. Recently I've been thinking that the format we have for publishing ideas on AI and rationality is not ideally suited for rituals (e.g. comment sections encourage critique and disagreement, whereas a ritual is more meant to be shared acknowledgment). I'm interested in suggestions for webpages that would allow a lot of people to feel connected to the other people commemorating Petrov Day.
I apologize for not posting this closer to Petrov Day. It’s been a busy month and there was much to think about.
You can view the EA Forum’s retrospective here.
This year was the third Petrov Day celebration on LessWrong in which the site was endangered, and the first year we joined together with the EA Forum. In case you missed it, neither site was taken down, despite 200 people being issued codes that would allow them to do so . Huzzah!
Although neither site went down (and thus there's no need for a blow-by-blow analysis of whodunit and why), there are some interesting things to review. In particular, there were some substantial criticisms of the Petrov Day ritual this year and last year that I want to address.Why Petrov Day
The annual Petrov Day post recounts the basic story of Petrov Day, yet given the questions that were asked this year about what Petrov Day should be, I think it’s right to first revisit why we celebrate Petrov Day in the first place. The following is my own personal take, the one from which I’ve acted, but it is Not Official.
We find ourselves at what may be one of the most critical periods in the history of humanity and the universe. This is kind of crazy–though I’ll refer you to the writings of Holden Karnofsky for a compelling argument for why believing anything else is equally crazy. In the next few decades, we might go extinct (or worse), or we might commence an explosion in progress and productivity that propels us to the stars, allowing us to take the seemingly barren universe and fill it with value.
Petrov Day is a celebration of not going extinct. It’s a commemoration of not taking actions that would destroy the world. It’s about how Petrov chose not to follow policy and relay his alarm because, in his personal estimation, it was probably a false alarm. If he had relayed the alarm, there’s a chance his superiors would have chosen to launch nuclear missiles at the US, and history would be very different.
We can identify two virtues worth applauding in the story:
- Choosing actions that don’t destroy the world
- Even in the face of pressures otherwise, using one’s judgment to not destroy the world
On September 26th, we celebrate these virtues and attempt to enshrine them in our community. We say to ourselves and others I accept the virtue of not destroying the world, even when there’s pressure to do it! We don’t do this for idle spiritual fulfillment–we do it because there’s a real chance that we or our community may soon face actual choices that resemble Petrov’s. Be it AI, bio, or general policy, our community is represented and our influence is real. As such, the values we take as our own matter.
In addition to the virtues directly displayed by Petrov, we can add others that are important for not destroying the world:
- Not taking unilaterally taking large (and irreversible) action
- Cooperating / being the kind of person who can cooperate / being the kind of community that cooperates with itself, especially when the stakes are high
Virtues 2 and 3 are in some tension and there’s probably a meta-virtue of judging which to apply. The default principle might be like “use your own judgment to avoid destructive actions; don’t rely only on your judgment alone to take [potentially] destructive actions.”Ritual
Eliezer posted about Petrov Day first in 2007 and in 2014, Jim Babcock wrote a ritual guide for a ceremony that people could conduct in small gatherings. At some point, a red button that would end the ceremony was introduced to the tradition. You’d be a real jerk to press it, thereby ending the Petrov Day celebration for everyone.
In 2019, the LessWrong team decided to create a Petrov Day ritual for the entire community by doing something with the website.
I wasn’t involved in Petrov Day that year, but I believe the team then wanted to celebrate all the four virtues I listed above (and maybe others too) as part of a general let’s celebrate the virtues involved in not ending the world. Unfortunately, it’s quite tricky to symbolize 2. (using your own judgment against incentives) within a game.
In addition to celebrating the four virtues above, LessWrong organizers wanted to further use Petrov Day as an opportunity to test (and hopefully prove) the trustworthiness and ability to cooperate of our community. Symbolism is powerful and it’s meaningful if you can get a large group of people to go along with your ritual. From that arose the challenge of finding N people who wouldn’t press the button. The higher the N we could find who don’t press the button, the more people we would have who are bought into our community–all of them treated the value of the trust-building symbolic exercise as more important than having fun or objecting or financial incentive or anything.
I feel pride and reassurance if I imagine truthfully saying “we have 1000 people that if we give them the chance to be a troll or a conscientious objector or a something–they don’t take it, they hold fast in not taking a destructive action”. The LessWrong frontpage is a big deal to the LessWrong team, and putting it on the line was a way of buying some gravitas for the ritual.
It’s because having N people who don’t press the button is such a powerful idea that people regard the ritual seriously and look poorly upon anyone who’d damage that. We succeeded in 2019 with no one pressing the button, yet failed in 2020. 2021 was to be a high-stakes tie-breaker involving another community.
Although the button(s) wasn’t pressed this year, I actually feel that we failed. We were unable to find 200 people (100 for each forum) who wanted to be part of our community of people who don’t take destructive actions. I don’t know that we failed by a lot, but I think we did. This is our failure as organizers as much as anyone else–we were responsible for choosing people and for designing the ritual.
There has been criticism that the LessWrong team unilaterally designed and deployed the community Petrov Day ritual, deciding for the community at large what was going to be celebrated and how. I think this is a fair charge.
There are historical explanations for why the Petrov ritual evolved the way that it did, and, separately, principles and policies that can speak to whether that's good or bad.
Historically, building A Big Red Button That Takes Down The Site felt like a pretty straightforward evolution of the tradition people were already enacting in their homes and parties. It didn't seem like the sort of step that required public discussion or vetting, and that still seems like the correct decision for 2019
Additionally, the team prepared its Petrov Day ritual somewhat at the last minute, and found itself in a position where a big discussion wasn't really a viable option.
Given the choice between a LessWrong team (and an overall community) where people are willing to try ambitious and potentially-cool things on their own judgment, or one where people err toward doing nothing without discussion and consensus, it seems clearly better for 2019 LW to have forged bravely ahead.
(This is actually a good place to distinguish the Petrov Day moral of "don't take irreversible and destructive actions on your own authority" from a more general moral of "don't do anything on your own authority." The latter is no good.)
That being said, though: community rituals are for the community, and LessWrong is closer to being something like a public utility than it is to being the property of the LessWrong team. At this stage, it feels right and proper for the community to have greater input and a greater say than in 2019, and without having specific plans, I expect us to put real effort into making that happen well in advance of Petrov Day 2022. This feels especially important given both that Petrov Day now seems like it's going to be an enduring piece of our subculture, and also that we want it to be.Not Getting Opt-In
Speaking of consulting the community, the 2021 ritual consisted of making people part of the game involuntarily by sending them launch codes. I see a few different complaints here.
The first is that launch codes are hazardous. Because the Petrov Day ritual is treated seriously (more on this below), someone who enters them (or just talks about entering them!) is subject to real social sanction, up to and including it affecting their job prospects. Our community takes character judgments seriously, and it's not at all clear what aspects of something like Petrov Day are "off limits" when it comes to evaluating people's cooperativeness, trustworthiness, impulsiveness, and general judgment.
In a world where the letter containing the codes was unambiguous about the cultural significance and the stakes of the Petrov Day ritual, I think receiving the launch codes would only endanger the highly impulsive and those with poor reading comprehension (and those should reasonably affect your job prospects). However, I think the way I wrote this year’s letter could be interpreted as a “Murder Mystery” invitation by someone not aware of the Petrov Day context. Plus, the letter didn’t explain the cultural significance to people who hadn’t been following along the LessWrong Petrov Day celebrations in last two years, which especially seems like a misstep when reaching out to a whole new subculture (i.e., the EA Forum).
I screwed up on that account and I’m sorry to anyone I put at risk. If you had pressed the button, it would have been on me.
The second–and I think more serious–complaint around lack of opt-in is that it leaves people who object to the ritual with no good option. If you don’t press the button, you are tacitly cooperating with a ritual you object to; if you do press it, you’ll have destroyed value and be subject to serious social sanction.
Moreover, the organizers (me, EA Forum staff) have declared by fiat what the moral significance of people’s symbolic actions are. This goes beyond just deciding what the ritual is and into deciding what’s good and bad symbolic behavior (with strong social consequences). While the Petrov Day ritual might be innocuous, it is a scary precedent if LessWrong/EA Forum organizers freely shape the moral symbolic landscape this way, without the checks and balances of broader community discussion.
I think this is fair. and this makes me realize that the LessWrong team has more power (and therefore more responsibility) than we previously credited oursevles with. We set out to build culture, including ritual and tradition, but it’s another matter to start defining the boundaries of good and bad. I think possibly this should be done, but again probably with more community consultation.Why So Serious
Related to both complaints is the fact that Petrov Day has been treated increasingly seriously. It’s because it’s serious that people will sanction you if you press the button. And it’s because you believe it’s too serious that you might want to object/boycott the ritual (well, that’s one reason).
I think the degree of seriousness that the ritual is treated with is one of the questions that should be reconsidered next year in consultation with the community. It's possible, for instance, that Petrov Day should be a place where some amount of mischievousness is considered fair game, and not representative of someone's global character.
Notwithstanding, I personally want to defend the position that a very high degree of seriousness is appropriate: a serious ritual for a serious situation. The stakes we find ourselves facing in this century are genuinely high–astronomical value vs extinction–and it makes sense to me to have a ritual that we treat with reverence, to celebrate and encourage values that we treat as somewhat sacred. Or in short, things matter, so let’s act like they do. I don’t know that this argument will win out on net, but I think seriousness should be considered.
Aside from a general position that Petrov Day should not be serious, some have argued in particular the most recent Petrov Day ritual should be lighthearted because the only thing at stake is the LessWrong/EA Forum page going down. My response to that is sadness. There is understandably inferential distance between the LessWrong team and others about how valuable LessWrong is and what it means to take the site down for a day. As I wrote in the Petrov Day post:
One of the sites [LessWrong, EA Forum] going down means hundreds to thousands of people being denied access to important resources: the destruction of significant real value. What's more it will damage trust between the two sites...For the rest of the day, thousands of people will have a hard time using the site, some posts and comments will go unwritten.
LessWrong is not a mere source of entertainment. It’s a site whose content shapes how people think about their lives and make major decisions. If there was a person who was going to have their life changed by LessWrong (and this happens to many) who fails to because the site is down, that’s a tragic loss.
LessWrong is also used as a major communication tool between researchers. LessWrong being offline is not so different from removing a day from a major research conference. Or, to change tack: the operating budget of the LessWrong website has historically been ~$600k, and this budget is artificially low because the site has paid extremely below-market salaries. Adjusting for the market value of the labor, the cost is more like $1M/year, or $2,700/day. If I assume LessWrong generates more value than the cost required to run it, I estimate that the site provides at least $2,700/day in value, probably a good deal more.
Still, if we want stakes for the ritual/exercise/game, probably better to use something with lower inferential distance. It’s on me as an organizer to mistakenly think that just because I think something is valuable, that will be transparent to others, and given that, I accept that it’s on me that not everyone thought the last Petrov Day iteration should be a big deal.
I could imagine it being better if there’s $5-10k that simply gets burned if someone presses the button rather than going to some worthy cause. Either way, this debate has clearly not properly taken place.
For an idea of what next year could look like, see these notes from Ben PaceAn Aside: Repeating Mistakes
Many of the issues pointed out this year were pointed out last year. It’s a real failure to not have addressed them. This is my (Ruby’s) fault. I took over organizing Petrov Day this year (inviting the EA Forum to join LessWrong) but didn’t go back and re-read through the previous year’s comments. Had I done so, I could have avoided repeating some of the previous mistakes.
I do think that repeating mistakes is quite bad and am quite sorry for that.Wrapping Up
Stanislav Petrov was on duty at a particularly fraught time in history. I think we are, too. This makes it imperative to think about the kinds of decisions we might face and prepare ourselves for them. It makes it crucial that know and practice our values and principles, so that we can rely on them even when temptations are strong or matters are unclear.
Rituals and traditions are what keep people true to their values. Having them or not might be the difference between us being a community that can succeed at its ambitious goals vs not–the difference between colonizing the stars and annihilation.
I regret the flaws of Petrov Day rituals so far, but I’m excited to keep iterating and innovating so we can make these essential values part of our community, cultures, and selves.
Not sure how many people would consider this feature useful: Imagine that you reply on someone else's comment, and the person edits their comment later. I think it might be useful (perhaps depending on circumstances) to get a notification.
Notification "XY edited a comment you replied to" should appear at the same place as when you get a reply. In perfect case, the tooltip would highlight the difference between the original and the updated comment.
Use cases that I imagine:
- Person A makes a comment. Person B makes a reply disagreeing with the original comment. Upon reading this, person A changes their mind and updates the original comment like this: "[EDIT: Actually, B makes a good argument against]". This feature would show this information in person B's inbox, without A having to write a separate reply to their comment.
- Person A makes a comment. Person B makes a disagreeing reply. Person A silently updates their original comment to make B's response seem silly.
This feature would apply only to comments that reply to comments, i.e. not to the top-level comments, because I assume that minor modifications of articles are sometimes too frequent (and would flood the inboxes of top-level commenters), and because more people would notice a substantial stealthy article edit.
An argument against this feature is that some people can make frequent insubstantial edits to their comments (e.g. fix typos), which also could flood the inboxes of the repliers. Then this feature would be annoying. Possible solutions:
- Multiple unread notifications for the same comment are merged into one.
- Some heuristic (e.g. only punctuation or isolated words are modified) to detect insubstantial edits.
- Or an opposite solution (covering only the first use case), where a heuristic would identify substantial edits (e.g. where the added text contains a word like "edit", "update", "change", "modify", "remove", "delete").
crossposted from spacelutt.comDetermining Control
The only thing that’s really out of your control is things that happen in the past, since time really only flows forward.
The “Challenging the Difficult” sequence in The Sequences is about how often you’ll be wrong at labelling something “impossible” (which is of course synonymous with “outside of your control”, except in “outside your control” being even more retreat-y than impossible as it implies another human can do it, just not you).
Your body will be ultimately destroyed, but this is not seen as bad, since it is out of your control.
Even if you can’t control it, it still seems bad.
It seems wrong to be changing your definition of bad based on what’s in your control or not.
Care, but don’t turmoil
Just realised: this is particularly bad when it comes to things you don’t have personal stake it. If you were to say “I don’t have control over my fatigue problem” too soon, they’re you’re gonna still want to think about it like it’s in your control and still go wild with attempts and ideas, because it still effects you, there’s no way of getting out of your sleep problem mattering in your world. You still feel it, whether or not it’s in your control.
But on a larger scale…
Eg. if some random kid in Africa is starving to death and someone tells you this and helping would be inconvenient, you could very well say “it’s not in my control” and totally forget about it and have it lose it’s effect on you. But the kid still starves, but the kid is not in your world. Thus you’re not motivated to really, really try before you actually declare something is impossible.
This is even more true when discussing abstract threats to the future.
I have seen many people struggling to excuse themselves from their ethics. Always the modification is toward lenience, never to be more strict. And I am stunned by the speed and the lightness with which they strive to abandon their protections. Hobbes said, “I don’t know what’s worse, the fact that everyone’s got a price, or the fact that their price is so low.” So very low the price, so very eager they are to be bought. They don’t look twice and then a third time for alternatives, before deciding that they have no option left but to transgress—though they may look very grave and solemn when they say it. They abandon their ethics at the very first opportunity. “Where there’s a will to failure, obstacles can be found.” The will to fail at ethics seems very strong, in some people.Self Deception
If you’re able to manufacture indifference to anything, then you’ll be motivated to get more false negatives on things outside of your control so that you don’t have to go through the hard, hard work on changing them, since you have no emotions to spur you.No Time-Wasting
Marcus Aurelius once wrote “You could leave life right now. Let that determine what you do and say and think.” That was a personal reminder to continue living a life of virtue NOW, and not wait. It was a form of memento mori – an ancient practice of meditating on your mortality.
This seems true and a great point. Especially since we don’t actually know death’ll be solved when the singularity comes.
The Memento Mori part seems solid. *Nods in approval*.
Even if death is solved, I wouldn’t excactly want to be putting things off all the times. The one who does doesn’t wait and puts finishing touches on their life and character all the time (the best they can at the time, knowing more growth is to come) I do think will have life better, will get more satisfied, will waste less time, and will regret less is something does happen.Negative Visualisation
I have often mention how the phenomenon of Hedonic Adaptation means that we constantly get used to the things we have and then begin to take them granted. Negative visualization is a simple exercise that can remind us how lucky we are. The premise is simple, just imagine that bad things have happened, or that good things have not. You decide the scale of the catastrophe:
- Losing all your possessions
- Never having met your spouse
- Losing a family member
- Losing a sense such as your sight or your hearing.
You can also imagine how situations that you are about to embark in will go wrong.
While you may think that this type of pessimism is not conductive to a happy and fulfilling life, it can actually turn your life into pure gold by making you realise that all these bad things have not happened to you.
Seems solid, gratitude in general seems pretty scientifically backed up.
Eliezer literally on stoicism: https://www.lesswrong.com/posts/vjmw8tW6wZAtNJMKo/which-parts-are-me