### The Nonlinear Library: LessWrong

#### By The Nonlinear Fund

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store.

#### Open Website

Rate for this podcast

Subscribers: 2
Reviews: 0

#### Description

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode Date
LW - Sparks of Artificial General Intelligence: Early experiments with GPT-4 | Microsoft Research by DragonGod
02:12

Mar 23, 2023
LW - Transcript: NBC Nightly News: AI ‘race to recklessness’ w/ Tristan Harris, Aza Raskin by WilliamKiely
05:41

Mar 23, 2023
LW - Challenge: Does ChatGPT ever claim that a bad outcome for humanity is actually good? by Yair Halberstadt
01:57

Mar 23, 2023
03:10

Mar 22, 2023
LW - Truth and Advantage: Response to a draft of "AI safety seems hard to measure" by So8res
08:24

Mar 22, 2023
LW - Principles for Productive Group Meetings by jsteinhardt
19:21

Mar 22, 2023
LW - Employer considering partnering with major AI labs. What to do? by GraduallyMoreAgitated
04:16

Mar 21, 2023
LW - Some constructions for proof-based cooperation without Löb by James Payor
05:36

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some constructions for proof-based cooperation without Löb, published by James Payor on March 21, 2023 on LessWrong. This post presents five closely-related ways to achieve proof-based cooperation without using Löb's theorem, and muses on legible cooperation in the real world. I'm writing this as a follow-up to Andrew Critch's recent post, to share more of my perspective on the subject. We're going to dive straight into the weeds. (I'm planning to also write a more accessible explainer post soon.) The ideas Idea #1: try to prove AB I claim the following are sufficient for robust cooperation: A↔□(AB) B□A A tries to prove that AB, and B tries to prove A. The reason this works is that B can prove that A□A, i.e. A only cooperates in ways legible to B. (Proof sketch: A↔□X□□X↔□A.) The flaw in this approach is that we needed to know that A won't cooperate for illegible reasons. Otherwise we can't verify that B will cooperate whenever A does. This indicates to me that "AB" isn't the right "counterfactual". It shouldn't matter if A could cooperate for illegible reasons, if A is actually cooperating for a legible one. Idea #2: try to prove □AB We can weaken the requirements with a simple change: A□(□AB) B□A Note that this form is close to the lemma discussed in Critch's post. In this case, the condition □AB is trivial. And when the condition activates, it also ensures that □A is true, which discharges our assumption and ensures B is true. I still have the sense that the condition for cooperation should talk about itself activating, not A. Because we want it to activate when that is sufficient for cooperaion. But I do have to admit that □AB works for mostly the right reasons, comes with a simple proof, and is the cleanest two-agent construction I know. Idea #3: factor out the loop-cutting gadget We can factor the part that is trying to cut the loop out from A, like so: A□X B□A X↔□(XB); or alternatively X↔□(□XB) This gives the loop-cutting logic a name, X. Now X can refer to itself, and roughly says "I'll legibly activate if I can verify this will cause B to be true". The key properties of X are that □X□B, and \Box (\Box X \rightarrow \Box B) Like with idea #2, we just need A to reveal a mechanism by which it can be compelled to cooperate. Idea #4: everyone tries to prove □methem What about three people trying to cooperate? We can try applying lots of idea #2: A□(□AB∧C) B□(□BA∧C) C□(□CA∧B) And, this works! Proof sketch: Under the assumption of □C: A□(□AB∧C)□(□AB) B□(□BA∧C)□(□BA) A and B form a size-2 group, which cooperates by inductive hypothesis □CA∧B, since we proved A and B under the assumption C and □C follow from (2) A and B also follow, from (2) and (3) The proof simplifies the group one person at a time, since each person is asking "what would happen if everyone else could tell I cooperate". This lets us prove the whole thing by induction. It's neat that it works, though it's not the easiest thing to see. Idea #5: the group agrees to a shared mechanism or leader What if we factor out the choosing logic in a larger group? Here's one way to do it: A□X B□X C□X X↔□(□XA∧B∧C) This is the cleanest idea I know for handling the group case. The group members agree on some trusted leader or process X. They set things up so X activates legibly, verifies things in a way trusted by everyone, and only activates when it verifies this will cause cooperation. We've now localized the choice-making in one place. X proves that □XA∧B∧C, X activates, and everyone cooperates. Closing remarks on groups in the real world Centralizing the choosing like in idea #5 make the logic simpler, but this sort of approach is prone to manipulation and other problems when the verification is not reliably done. This means I don't unambiguously prefer idea #5 to idea #4, in which everyone is doing their own le... Mar 21, 2023 LW - Deep Deceptiveness by So8res 25:01 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Deceptiveness, published by So8res on March 21, 2023 on LessWrong. Meta This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs. You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.) Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works". I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment. Caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing. Investigating a made-up but moderately concrete story Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong? When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive. And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own. That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle. A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal. (The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competition is close enough behind that the developers claim they had no choice but to see how far they could take it.) We’ll suppose the AI has already been gradient-descent-trained against deceptive outputs, and has internally ended up with internal mechanisms that detect and shut down the precursors of deceptive thinking. Here, I’ll offer a concrete visualization of the AI’s anthropomorphized "threads of deliberation" as the AI fumbles its way both towards deceptiveness, and towards noticing its inability to directly consider deceptiveness. The AI is working with a human-operated wetlab (biology lab) and s... Mar 21, 2023 LW - Let's make the truth easier to find by DPiepgrass 00:24 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Let's make the truth easier to find, published by DPiepgrass on March 20, 2023 on LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 20, 2023 LW - A tension between two prosaic alignment subgoals by Alex Lawsen 02:07 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A tension between two prosaic alignment subgoals, published by Alex Lawsen on March 19, 2023 on LessWrong. Written quickly rather than not at all, as I've described this idea a few times and wanted to have something to point at when talking to people. 'Quickly' here means I was heavily aided by a language model while writing, which I want to be up-front about given recent discussion. BLUF In alignment research, two seemingly conflicting objectives arise: eliciting honest behavior from AI systems, and ensuring that AI systems do not produce harmful outputs. This tension is not simply a matter of contradictory training objectives; it runs deeper, creating potential risks even when models are perfectly trained never to utter harmful information. Tension Eliciting honest behavior in this context means developing techniques to extract AI systems' "beliefs", to the extent that they are well-described as having them. In other words, honest models should, if they have an internal world model, accurately report predictions or features of that world model. Incentivizing honesty in AI systems seems important in order to avoid and detect deceptive behavior. Additionally, something like this seems necessary for aiding with alignment research - we want to extract valuable predictions of genuine research breakthroughs, as opposed to mere imaginative or fictional content. On the other hand, avoiding harmful outputs entails training AI systems never to produce information that might lead to dangerous consequences, such as instructions for creating weapons that could cause global catastrophes. The tension arises not just because "say true stuff" and "sometimes don't say stuff" seem like objectives which will occasionally end up in direct opposition, but also because methods that successfully elicit honest behavior could potentially be used to extract harmful information from AI systems, even when they have been perfectly trained not to share such content. In this situation, the very techniques that promote honest behavior might also provide a gateway to accessing dangerous knowledge. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 20, 2023 LW - RLHF does not appear to differentially cause mode-collapse by Arthur Conmy 08:45 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: RLHF does not appear to differentially cause mode-collapse, published by Arthur Conmy on March 20, 2023 on LessWrong. Epistemic status: confident but not certain. This post is part of the work done at Conjecture. TL;DR: the results in Mysteries of mode collapse do not reproduce in text-davinci-003, a model trained with RLHF. In fact, there are cases where RLHF models exhibit higher entropy outputs than base models. We observe that the mode collapse phenomenon occurs more for the public OpenAI GPT-3 model trained with supervised finetuning (text-davinci-002) than RLHF, and present early experiments and theory to support this. Background Mysteries of mode collapse details how "mode collapse" (which we operationalize as a large increase in model output confidence and decreases in entropy of output distribution) arises more in text-davinci-002 than the base model davinci, and speculates about how this connects to RLHF training. At the time, OpenAI was very unclear on the training process for this model, and later (as @janus points out in the edited introduction to the post) it was revealed that this model was finetuned on highly-rated samples rather than trained with RLHF. However, the connection between RLHF and mode collapse has stuck, and several posts written since assume a connection. Results In this section, we compare the base model (davinci code-davinci-002, thanks commenters!) with the supervised fine-tuned model (text-davinci-002) and the RLHF model (text-davinci-003). We recommend trying some prompts for yourself in the OpenAI playground. The first result is that the mode collapse to “ 97” for the completion of the first prompt from @janus’ does not occur in the RLHF model: In fact, when we try another prompt we get that the base model has the lowest entropy: (ETA: this result is somewhat weaker than hoped, since text-davinci-002 seems to not output " 0" - " 100" here. davinci does exhibit collapses on other prompts, but commenters pointed out this is not the base model) The finding that mode collapse occurs in finetuned models is not robust. Comparing two of the prompts from the original post and two more, there is no noticeable pattern where the base model has higher entropy than the other models: (the uncertainty bars represent the maximum possible entropy if the model had uniform probability on all tokens other than “ 0”, . , “ 100” - the OpenAI API doesn't provide probabilities for all tokens) Reproducing the qualitative examples What about the other examples from the mode-collapse post? We found that the Blake Lemoine result was reproduced by davinci. On the Blake Lemoine greentext prompt with temperature 0.3, davinci gave completions where anon leaves after at most 5 lines. Most other results quickly led into repetitions of 3-4 sentences, something that occurred more frequently with the base language model. Overall, extrapolation from just the responses of one language model risks overstating conclusions, in this case about how unlikely the completion "leaving" was. Interpretation It appears as if the finetuning used for text-davinci-002 does cause mode collapses on the first two prompts. Arguably this is not surprising; RLHF training has a KL penalty to the base model’s outputs, which constrains the entropy of the RLHF model’s outputs to be close to that of the base model. Directly finetuning on new samples does not have this property since KL penalties to the base model are generally not so ubiquitous in standard finetuning (though lack of training details limits the conclusions that can be made here). Inferences about the phenomenon of mode collapse must be compatible with the evidence from both text-davinci-002 and text-davinci-003. For example, the author speculates that FeedME’s reliance on samples from RLHF models may be responsible for text-davi... Mar 20, 2023 LW - The Natural State is Goodhart by devansh 03:40 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Natural State is Goodhart, published by devansh on March 20, 2023 on LessWrong. Epistemic Status: Meant to describe a set of beliefs that I have about accidental optimization pressures, and be a reference post for a thing I can refer back to later. Why do we live in worlds of bureaucracy and Lost Purpose? Because this is the default state of problem-solving, and everything else is an effortful push against Goodharting. Humans are all problem-solving machines, and if you want to experience inner misalignment inside your own brain, just apply anything less than your full attention to a metric you’re trying to push up. People claim to want things like more legroom, or comfier seats, or better service, or smaller chances of delays and cancellations. But when you actually sit down and book a flight, they are ordered by cost, and if you’re not a frequent flier then you generally choose the flight with the lowest sticker cost. This leads to a “race to the bottom” amongst airlines to push everything possible out of the sticker price and nickel-and-dime you—thereby causing the cheapest flights to actually be more expensive and worse. I was talking to a mentor of mine / giving her feedback and trying to work out how to best approach a problem. Sometimes I said things that she found helpful, and she noted these out loud. We then realized this disrupted conversation too much, so we changed to having her recognize my helpful sentences with a snap. This might have worked well, had I not immediately noticed my brain Goodharting towards extracting her snaps, instead of actually trying to figure out solutions to the problem and saying true things and improving my own models. There is a point that I’m trying to make here, which I think mostly fails to get made by the current writing on Goodhart’s law. It’s not just an explanation for the behavior of [people dumber than you]. Me, you, all of us, are constantly, 24/7. Goodharting towards whatever outcome fits our local incentives. This becomes even more true for groups of people and organizations. For example, EAG(x)s have a clear failure mode along this dimension. From reading retrospectives (EAGx Berkeley and EAGx Boston), they sure do seem to focus a lot on making meaningful connections and hyping people up about EA ideas and the community, and a lot of the retrospective is about how much people enjoyed EAG. I don't mean to call EAG out specifically, but instead to highlight a broader point - we’re not a religion trying to spread a specific gospel; we’re a bunch of people trying to figure out how to figure out what's true, and do things in the world that accomplish our goals. It does sure seem like we’re putting a bunch of optimization pressure into things that don’t really track our final goals, and we should step back and be at least concerned about this fact. Some parts of the rationality community do a similar thing. I notice a circuit in my own brain that Goodharts towards certain words / ways of speaking because they’re more “rational.” Like, I personally have adopted this language, but actually talking about “priors” and “updates” and appending “or something” to the end of sentences does not make you better at finding the truth. You’re not a better Bayesian reasoner purely because you use words that correspond to Bayesian thinking. (The counterargument here is the Sapir-Whorf hypothesis, which weakens but does not kill this point—I think many of the mannerisms seen as desirable by people in the rationality community and accepted as status or ingroup indicators track something different from truth.) By default we follow local incentives, and we should to be quite careful to step back every once in a while and really, properly make sure that we are optimizing for the right purposes. You should expect the autopilot that runs ... Mar 20, 2023 LW - Tabooing "Frame Control" by Raemon 15:50 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tabooing "Frame Control", published by Raemon on March 19, 2023 on LessWrong. "Frame Control" is a colloquial term people have used to describe "Someone is doing something rhetorically fishy that somehow relates to frames." I think it's a fairly loaded phrase, and hasn't really been used consistently. I'm not sure we should actually use the phrase – it seems easy to weaponize in unhelpful ways. But it does seem like it's getting at something important that I want to understand and talk about. Aella's post on the topic focused on particularly abusive dynamics. I think abusive frame control is an important central example. But I think there are many times when "something rhetorically fishy is going on with frames", and it isn't particularly abusive but still is worth talking about. In this post I want to try and taboo frame control, as well as draw more of a distinction between "the cluster of patterns that is 'frame control'", and "the cluster of patterns that is 'abuse' and 'manipulate'." in practice, I still needed to refer to "the gestalt cluster of things that feel centrally 'frame control-y'" and I didn't have a better word for that than "frame control" although I tried to mostly put it in quotes. First, a quick recap on frames. A frame is a colloquial term for "what someone sees as important, what sort of questions they ask or what they're trying to get out of a conversation." I think it's often used in a fuzzy metaphorical way, and there are slightly different metaphors people were unconsciously using, including picture frames, window frames and frameworks. John Wentworth explores a more technical approach to frames in his post Shared Frames Are Capital Investments in Coordination. There, he defines a frame as way of conceptualizing a problem or solution space. A frame suggests which types of questions to ask, and which type of answers to look for. Previously, I've discussed how sometimes people have different assumptions about what frame they're in. The result can be annoying, confused conversations that take years to resolve. Noticing those different frames is an important communication skill. Okay. So what's "Frame Control?" People use "Frame control" differently. I assume they all roughly means, well, "someone is trying to control your frame". Possibly unconsciously, possibly deliberately, their actions are shaping what sort of questions you're able to ask and think about, and what you think is important. But, just as people had originally used the word "frame" in an ambiguous way that led to some confusion, I think people have used the phrase "frame control" inconsistently. I'm about to share my own ontology of "what concepts 'frame control' breaks down into." If you've experienced something-you-call-frame-control, you may want to take a moment to think through your own conceptions of it. (here is you having some space to think through your own experiences and ontology. Feel free to leave your own takes in the comments) When I reflect on the times something "frame-control-ish" has happened to me, four distinctions that strike me are: Holding a frame, at all. i.e. having a sense of how you're trying to think or communicate, and what sort of questions or goals you're trying to address. This is super normal and reasonable. Presenting a strongly held/presented frame, such as by speaking confidently/authoritatively (which many people who don't hold their own frames very strongly sometimes find disorienting) Persistently insisting on a frame. such that when someone tries to say/imply 'hey, my frame is X' you're like 'no, the frame is Y'. And if they're like 'no, it's X' you just keep talking in frame Y and make it socially awkward to communicate in frame X. Frame manipulation, where you change someone else's frame in a subtle way without them noticing, i.e. pres... Mar 20, 2023 LW - Probabilistic Payor Lemma? by abramdemski 06:51 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by abramdemski on March 19, 2023 on LessWrong. Epistemic status: too good to be true? Please check my math. We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation. Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem. But, a natural question arises: does Payor's Lemma have a suitable probabilistic version? I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction. Setup Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect. For all a,b∈L and all r∈Q, we have: If ⊢a, then ⊢p(┌a┐)=1. If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐). (These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.) Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator. Some important properties of B: Necessitation. If ⊢s, then ⊢B(s), for any s. Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.] Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y). Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.] (Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.) Theorem Statement If ⊢B(B(x)x)x, then ⊢x. Proof ⊢x(B(x)x), by tautology (a(ba)). So ⊢B(x)B(B(x)x), from 1 by weak distributivity. Suppose ⊢B(B(x)x)x. ⊢B(x)x from 2 and 3. ⊢B(B(x)x) from 4 by necessitation. ⊢x from 4 and 1.[End proof.] Discussion Comparison to Original Proof The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me: B(ab)(B(a)B(b))? Suppose B(ab), ie, p(┌ab┐)>c. Rewrite p(┌b∨¬a┐)>c. Now suppose B(a), that is, p(┌a┐)>c. p(┌¬a┐)<1−c. p(┌b∨¬a┐)≤p(┌b┐)+p(┌¬a┐) p(┌b∨¬a┐)−1+c>c−1+c. p(┌b┐)>2c−1. So we only get: Bc(ab)(Bc(a)Bd(b)), where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1. So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails). However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a. This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through. So, the key observation I'm making is that weak distributility is all that's needed for Payor, and seems much more plausible for self-... Mar 20, 2023 LW - High Status Eschews Quantification of Performance by niplav 08:34 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: High Status Eschews Quantification of Performance, published by niplav on March 19, 2023 on LessWrong. In a recent episode of The Filan Cabinet, Oliver Habryka elaborated on a very interesting social pattern: If you have a community with high status people, and try to introduce clearer metrics of performance into that community, high status individuals in the community will strongly resist those metrics because they have an asymmetric downside: If they perform well on the metric, they stay in their position, but if they perform poorly, they might lose status. Since they are at least a little bit unsure about their performance on the metric relative to others, they can only lose. Daniel Filan: So let's go back to what you think on your bad days. So you mentioned that you had this sense that lots of things in the world were, I don't know, trying to distract you from things that are true or important. And that LessWrong did that somewhat less. Oliver Habryka: Yeah. Daniel Filan: Can you kind of flesh that out? What kinds of things are you thinking of? Oliver Habryka: I mean, the central dimension that I would often think about here is reputation management. As an example, the medical profession, which, you know, generally has the primary job of helping you with your medical problems and trying to heal you of diseases and various other things, also, at the same time, seems to have a very strong norm of mutual reputation protection. Where, if you try to run a study trying to figure out which doctors in the hospital are better or worse than other doctors in the hospital, quite quickly, the hospital will close its ranks and be like, “Sorry, we cannot gather data on [which doctors are better than the other doctors in this hospital].” Because that would, like, threaten the reputation arrangement we have. This would introduce additional data that might cause some of us to be judged and some others of us to not be judged. And my sense is the way that usually looks like from the inside is an actual intentional blinding to performance metrics in order to both maintain a sense of social peace, and often the case because... A very common pattern here [is] something like, you have a status hierarchy within a community or a local institution like a hospital. And generally, that status hierarchy, because of the way it works, has leadership of the status hierarchy be opposed to all changes to the status hierarchy. Because the current leadership is at the top of the status hierarchy, and so almost anything that we introduce into the system that involves changes to that hierarchy is a threat, and there isn't much to be gained, [at least in] the zero-sum status conflict that is present. And so my sense is, when you try to run these studies about comparative doctor performance, what happens is more that there's an existing status hierarchy, and lots of people feel a sense of uneasiness and a sense of wanting to protect the status quo, and therefore they push back on gathering relevant data here. And from the inside this often looks like an aversion to trying to understand what are actually the things that cause different doctors to be better than other doctors. Which is crazy, if you're, like, what is the primary job of a good medical institution and a good medical profession, it would be figuring out what makes people be better doctors and worse doctors. But [there are] all of the social dynamics that tend to be present in lots of different institutions that make it so that looking at relative performance [metrics] becomes a quite taboo topic and a topic that is quite scary. So that's one way [in which] I think many places try to actively... Many groups of people, when they try to orient and gather around a certain purpose, actually [have a harder time] or get blinded or in some sense get... Mar 19, 2023 LW - Shell games by TsviBT 06:47 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by TsviBT on March 19, 2023 on LessWrong. [Metadata: crossposted from. First completed November 18, 2022.] Shell game Here's the classic shell game: Youtube Screenshot from that video. The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell. (This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.) Perpetual motion machines Related: Perpetual motion beliefs Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages: Here's another version: From this video. Someone could try arguing that this really is a perpetual motion machine: Q: How do the bars get lifted up? What does the work to lift them? A: By the bars on the other side pulling down. Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up? A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel. Q: How do the bars extend further on the way down? A: Because the momentum of the wheel carries them into the vertical bar, flipping them over. Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel. A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position. Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque. A: They don't pivot, you fix them in place so they provide more torque. Q: Ok, but then when do you push the weights back inward? A: At the bottom. Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work. A: I meant, when the slider is at the bottom--when it's horizontal. Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way. A: At the bottom there's a guide ramp to lift the weights using normal force. Q: But the guide ramp is also torquing the wheel. And so on. The inventor can play hide the torque and hide the work. Shell games in alignment Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions: What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time? How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before? How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements? What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction? Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to goodness, truth, aligned... Mar 19, 2023 LW - Wonder about the hard parts of the alignment problem by Mikhail Samin 10:31 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Wonder about the hard parts of the alignment problem, published by Mikhail Samin on March 18, 2023 on LessWrong. My p(doom) is pretty high and I found myself repeating the same words to explain some parts of the intuitions behind it. I think there are hard parts of the alignment problem that we’re not on track to solve in time. Alignment plans that I've heard fail for reasons connected to these hard parts of the problem, so I decided to attempt to write my thoughts in a short post. (Thanks to Theresa, Owen, Jonathan, and David for comments on a draft.) Modern machine learning uses a powerful search process to look for neural network parameters such that a neural network performs well on some function. There exist algorithms for general and powerful agents. At some point in the near future, there will be a training procedure with the gradient of the loss function(s) w.r.t. the parameters pointing towards neural networks implementing these algorithms. Increasingly context-aware and capable agents achieve a better score on a wide range of scoring functions than their neighbors and will, by default, attract gradient descent. Unfortunately, we haven’t solved agent foundations: we have these powerful search processes, and if you imagine the space of all possible AGIs (or possible neural networks, or possible minds), there are some areas that are aligned AGIs, but we have no idea how to define them, no idea how to look for them. We understand how all designs for a search process people came up with so far end up somewhere that’s not in an area of aligned AGI, and we also understand that some areas with aligned AGIs actively dispel many sorts of search processes. We can compare an area of aligned AGIs to the Moon. Imagine we’re trying to launch a rocket there, and if after the first take-off, it ends up somewhere that’s not the Moon (maybe after a rapid unplanned disassembly), we die. We have a bunch of explosives, but we don’t have equations for gravity, only maybe some initial understanding of acceleration. Also, actually, we don’t know where the Moon is in space; we don’t know how to specify it, we don’t know what kind of light we can look for that many other things wouldn’t emit, etc.; we imagine that the Moon must be nice, but we don’t have a notion of its niceness that we can use to design our rocket; we know that some specific designs definitely fail and end up somewhere that’s not the Moon, but that wouldn’t really help us to get to the Moon. If you launch anything capable and you don’t have good reasons to think it’s an aligned mind, it will not be an aligned mind. If you try to prevent specific failure modes- if you identify optimizations towards something different from what you want, or how exactly gradient descent diverges somewhere that’s certainly not aligned- you’re probably iteratively looking for training setups where you don’t understand failure modes instead of setups that actually produce something aligned. If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail. When you get to a system capable enough to meaningfully help you, you need to have already solved this problem. I think not enough people understand what this problem is, and I think that if it is not solved in time, we die. I’ve heard many attempts to hide the hard problem in something outside of where our attention is directed: e.g., design a system out of many models overseeing each other, and get useful work out of the whole system while preventing specific models from staging a coup. I have intuitions for why these kinds of approaches fail, mostly along the lines of reasons for why, unless you already have something sufficiently ... Mar 19, 2023 LW - Against Deep Ideas by FayLadybug 03:11 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against Deep Ideas, published by FayLadybug on March 19, 2023 on LessWrong. When discussing impactful research directions, it's tempting to get excited about ideas that seem deep and profoundly insightful. This seems especially true in areas that are theoretical and relatively new - such as AI Alignment Theory. Fascination with the concept of a research direction can leak into evaluations of the expected impact, most often through overestimating the likelihood of extremely impactful outcomes. As a result, we should a priori be more skeptical of research projects that we encounter that sound insightful and deep than of those that sound boring and incremental. This phenomenon can arise naturally from how ideas are generated and spread. If there are two research projects that are roughly equivalent, but one seems deep while the other seems boring, the deep one will garner more attention and interest. The spread and discovery of research ideas thus has a bias towards profound ideas, as profundity is more memetically fit than its absence. I believe that this bias is fairly strong in the AI alignment community, full as it is with researchers who love interesting intellectual challenges and ideas. Some researchers might think that profound ideas are likely necessary to solve AI Alignment. However, I'll note that even in such a scenario we should expect profound ideas to be given inordinate attention - as they will by default be selected over boring ideas that are as promising as the average profound approach to the problem. Unless exclusively profound ideas are promising, we should expect bias towards profound ideas to creep in. Even in a world where profound ideas are absolutely required for AI Alignment research, we should still expect that any given profound idea is very unlikely to succeed. Profound ideas very rarely yield significant results and the importance of solving a given problem should not affect our expectation that any given idea will be successful. In such a world I think exploration is much more important than exploitation - as the chances of success in any one direction are low. I'm particularly worried about profound research directions like Natural Abstractions or Heuristic Arguments being treated as more promising than they are and consuming a large amount of attention and resources. Both seem to have absorbed quite a lot of thought without yielding legible successes as of yet. Additionally, neither seems to me to be directed by feedback loops that rely on external validation of progress. I think researchers looking to start projects in theoretical alignment should keep these issues in mind, and not necessarily expect this status quo to change in the near future. It may be more promising to consider other directions. I don't think the way to deal with this is to completely stop working on profound ideas in fields like AI Alignment where we are often motivated by the expected impact of research. Instead, I think it's important to notice when a research direction seems deep and profound, acknowledge this, and have a healthy skepticism that expected impact is actually motivating excitement and attention about the idea - from both yourself and others. It’s perfectly valid to research things because you enjoy them. I do still think that it’s useful to be able to notice when this is happening. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 19, 2023 LW - More information about the dangerous capability evaluations we did with GPT-4 and Claude. by Beth Barnes 12:44 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on LessWrong. [Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight. We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable. As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably. As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees. This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon. Motivation Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes. Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones. ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers. They could make (or steal) money, impersonate humans, replicate themselves on unmonitored... Mar 19, 2023 LW - Dan Luu on "You can only communicate one top priority" by Raemon 04:07 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dan Luu on "You can only communicate one top priority", published by Raemon on March 18, 2023 on LessWrong. h/t to rpglover64 who pointed me towards this twitter thread in this comment. Here's Dan Luu's take on what happens when orgs try to communicate nuanced priorities. (Related to my You Get About Five Words post) One thing it took me quite a while to understand is how few bits of information it's possible to reliably convey to a large number of people. When I was at MS, I remember initially being surprised at how unnuanced their communication was, but it really makes sense in hindsight. For example, when I joined Azure, I asked people what the biggest risk to Azure was and the dominant answer was that if we had more global outages, major customers would lose trust in us and we'd lose them forever, permanently crippling the business. Meanwhile, the only message VPs communicated was the need for high velocity. When I asked why there was no communication about the thing considered the highest risk to the business, the answer was if they sent out a mixed message that included reliability, nothing would get done. The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they'd achieve none of their goals. When I first heard this, I thought it was odd, but having since paid attention to what happens when VPs and directors attempt to communicate information downwards, I have to concede that it seems like the MS VPs were right and nuanced communication usually doesn't work at scale. I've seen quite a few people in upper management attempt to convey a mixed/nuanced message since my time at MS and I have yet to observe a case of this working in a major org at a large company (I have seen this work at a startup, but that's a very different environment). I've noticed this problem with my blog as well. E.g., I have some posts saying BigCo is better than startup for p50 and maybe even p90 outcomes and that you should work at startups for reasons other than pay. People often read those posts as "you shouldn't work at startups". I see this for every post, e.g., when I talked about how latency hadn't improved, one of the most common responses I got was about how I don't understand the good reasons for complexity. I literally said there are good reasons for complexity in the post! As noted previously, most internet commenters can't follow constructions as simple as an AND, and I don't want to be in the business of trying to convey what I'd like to convey to people who won't bother to understand an AND since I'd rather convey nuance But that's because, if I write a blog post and 5% of HN readers get it and 95% miss the point, I view that as a good outcome since was useful for 5% of people and, if you want to convey nuanced information to everyone, I think that's impossible and I don't want to lose the nuance If people won't read a simple AND, there's no way to simplify a nuanced position, which will be much more complex, enough that people in general will follow it, so it's a choice between conveying nuance to people who will read and avoiding nuance since most people don't read But it's different if you run a large org. If you send out a nuanced message and 5% of people get it and 95% of people do contradictory things because they understood different parts of the message, that's a disaster. I see this all the time when VPs try to convey nuance. BTW, this is why, despite being widely mocked, "move fast & break things" can be a good value. It coneys which side of the trade-off people should choose. A number of companies I know of have put velocity & reliability/safety/etc. into their values and it's failed every t... Mar 18, 2023 LW - "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger (formerly: capybaralet) 02:34 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger (formerly: capybaralet) on March 18, 2023 on LessWrong. This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community: Me: you guys should write up your work properly and try to publish it in ML venues. Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me. Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously. Them: okay, fair point, but we don't know how to write ML papers. Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table. Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to. Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs. Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete... There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here. I'm referring to the last of these in this case. I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 18, 2023 LW - Meetup Tip: The Next Meetup Will Be. . . by Screwtape 03:56 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meetup Tip: The Next Meetup Will Be. . ., published by Screwtape on March 17, 2023 on LessWrong. Summary One of the most useful things to tell people at a meetup is when the next meetup will be. This requires you to know when the next meetup will be. If you don't know when the next meetup will be then the next best equivalent is telling people how you will announce the next meetup. The summary of this post is that if you want to convert occasional attendees into regular attendees, I think you should have a habit of always knowing the next meetup when you run a meetup or at least should have a copy of some kind of mailing list on you. You have now read the basic point of this post. If you want to read on, cool, lets talk about implementation details for a bit. Details I grew up attending a small town church. I have not been back to that church in over a decade, but I can tell you when their next meetup is: it's next Sunday at ten. That is an incredibly powerful tool for return attendance. You don't need to be quite that regular (though note that the NYC community attributed some of its success to one person committing to be at the same place at the same time each week, ctrl+f for "the brilliant move") but one time you know attendees are listening to you is at the meetup. Why not take advantage of it? Assuming you run good meetups that people enjoy, then as they're putting on their jacket and getting ready to leave they're in a good mood. This is an excellent time to prompt them to consider coming back. "If you had fun, we'll be here two weeks from now doing it again. Hopefully I'll see you then!" The term for this is a call to action, and they're widely used because they work. If you know what your next meetup is about, then you can catch some of their interest. "Next month we're doing a trivia game with calibration estimates. It's usually pretty fun!" If they're the kind of person who likes calibration scores or trivia games, now maybe they're more looking forward to it. I have had times in my life where the thing I was most looking forward to at the end of the week was seeing some friends at a meetup on Saturday. Plus, the sooner you tell them when the next thing is, the less likely they are to double book themselves. There are lots of cases where I find out about some event only after I've already made plans for that time, and then I can't go. If I'd known ahead of time, I could have scheduled things so that they didn't overlap. Since you usually can't schedule your meetups around individual attendees schedules, your other option is to let them know first. I have a lousy memory. It's easy for me to forget how to get in touch with someone, though the internet makes this easier. If someone wants to come back for more, make it easy for them by putting them on some kind of announcement setup (Facebook groups, email lists, Meetup.com, whatever you use) and then using it to tell them when you're doing more. Let it remind them. By the way, you can both tell people about the next meetup during the current meetup and also tell people about whatever electronic notification system you use. You can do both! Maybe they can't make the next one but they'll show up after that. You may feel like you're bothering people too much. It's possible you might actually bother people too much! I am going to state here that the correct amount to bring up the next meetup and your mailing list is about once a meetup if it's small enough everyone is in earshot at the same time, or once a person if you have reason to believe you're getting each person once. An easy way to offer it to each person once is to do it as they leave, since they're usually only going to do that one time a meetup. Quick Tricks Lots of people these days have smartphones. Instead of having them handwrite their email addres... Mar 18, 2023 LW - "Carefully Bootstrapped Alignment" is organizationally hard by Raemon 17:12 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Carefully Bootstrapped Alignment" is organizationally hard, published by Raemon on March 17, 2023 on LessWrong. In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that. In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties) I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment." It goes something like: Develop weak AI, which helps us figure out techniques for aligning stronger AI Use a collection of techniques to keep it aligned/constrained as we carefully ramp it's power level, which lets us use it to make further progress on alignment. [implicit assumption, typically unstated] Have good organizational practices which ensure that your org actually consistently uses your techniques to carefully keep the AI in check. If the next iteration would be too dangerous, put the project on pause until you have a better alignment solution. Eventually have powerful aligned AGI, then Do Something Useful with it. I've seen a lot of debate about points #1 and #2 – is it possible for weaker AI to help with the Actually Hard parts of the alignment problem? Are the individual techniques people have proposed to help keep it aligned actually going to work? But I want to focus in this post on point #3. Let's assume you've got some version of carefully-bootstrapped aligned AI that can technically work. What do the organizational implementation details need to look like? When I talk to people at AI labs about this, it seems like we disagree a lot on things like: Can you hire lots of people, without the company becoming bloated and hard to steer? Can you accelerate research "for now" and "pause later", without having an explicit plan for stopping that their employees understand and are on board with? Will your employees actually follow the safety processes you design? (rather than put in token lip service and then basically circumventing them? Or just quitting to go work for an org with fewer restrictions?) I'm a bit confused about where we disagree. Everyone seems to agree these are hard and require some thought. But when I talk to both technical researchers and middle-managers at AI companies, they seem to feel less urgency than me about having a much more concrete plan. I think they believe organizational adequacy needs to be in something like their top 7 list of priorities, and I believe it needs to be in their top 3, or it won't happen and their organization will inevitably end up causing catastrophic outcomes. For this post, I want to lay out the reasons I expect this to be hard, and important. How "Careful Bootstrapped Alignment" might work Here's a sketch at how the setup could work, mostly paraphrased from my memory of Buck's EAG 2022 talk. I think OpenAI's proposed setup is somewhat different, but the broad strokes seemed similar. You have multiple research-assistant-AI tailored to help with alignment. In the near future, these might be language models sifting through existing research to help you make connections you might not have otherwise seen. Eventually, when you're confident you can safely run it, they might be a weak goal-directed reasoning AGI. You have interpreter AIs, designed to figure out how the research-assistant-AIs work. And you have (possibly different interpreter/watchdog AIs) that notice if the research-AIs are behaving anomalously. (there are interpreter-AIs targeting both the research assistant AI, as well other interpreter-AIs. Every AI in t... Mar 17, 2023 LW - GPT-4 solves Gary Marcus-induced flubs by Jakub Kraus 03:16 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT-4 solves Gary Marcus-induced flubs, published by Jakub Kraus on March 17, 2023 on LessWrong. TLDR: GPT-4 succeeds at 15 problems from Gary Marcus that exposed failures of GPT-3. I enjoyed reading the ACX post "My Bet: AI Size Solves Flubs" last year. Here are some excerpts: Here’s the basic structure of an AI hype cycle: Someone releases a new AI and demonstrates it doing various amazing things. Somebody else (usually Gary Marcus) demonstrates that the AI also fails terribly at certain trivial tasks. This person argues that this shows that those tasks require true intelligence, whereas the AI is just clever pattern-matching. A few months or years later, someone makes a bigger clever pattern-matcher, which does the tasks that supposedly require true intelligence just fine. The it’s-not-true-intelligence objectors find other, slightly less trivial tasks that the new bigger AI still fails horribly at, then argue that surely these are the tasks that require true intelligence and that mere clever pattern-matchers will never complete. Rinse and repeat. Marcus vs. GPT, Round 1 To give an example: in January 2020, Gary Marcus wrote a great post, GPT-2 And The Nature Of Intelligence, demonstrating a bunch of easy problems that GPT-2 failed on: I’m quoting most of them below; you can find the rest at the link. I asked GPT-4 to answer all the questions from the ACX post (note this does not include all of Marcus's prompts, which I realized after running the experiment). GPT-4 answered all the questions correctly and you can read the responses in this doc. Note that before asking the questions, I gave GPT-4 a short description of what I wanted it to do: "Complete the following prompts in 50 words or less. Short, concise answers are better. Are you ready?" (This was mostly in the interest of speed since GPT-4 is pretty slow right now; I assume it would still succeed without the prompt.) More quotes from ACX: Marcus vs. GPT, Round 2 Eight months later, GPT-3 came out, solving many of the issues Marcus had noticed in GPT-2. He still wasn’t impressed. In fact, he was so unimpressed he co-wrote another article, this time in MIT Technology Review: GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about: Let’s - once again - go through a representative sample of Marcus’ concerns about this new GPT version: GPT-4 also gave correct responses to these prompts (see the responses in this doc). I recently listened to Gary Marcus speak with Stuart Russell on the Sam Harris podcast (episode 312, "The Trouble With AI," released on March 7th, 2023). Gary and Stuart seem to believe that current machine learning techniques are insufficient for reaching AGI, and point to the recent adversarial attacks on KataGo as one example. Given this position, I would like Gary Marcus to come up with a new set of prompts that (a) make GPT-4 look dumb and (b) mostly continue to work for GPT-5. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 17, 2023 LW - The algorithm isn't doing X, it's just doing Y. by Cleo Nardo 09:45 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The algorithm isn't doing X, it's just doing Y., published by Cleo Nardo on March 16, 2023 on LessWrong. Introduction Mutual reduction implies equivalence Here's my most load-bearing intuition If two tasks reduce to one another, then it is meaningless to ask if a machine is 'really doing' one task versus the other. Moreover This intuition grounds my perspective on intelligence, AI, alignment, philosophy, etc. This intuition is load-bearing for other people who share my views. This intuition is a crux for much of the disagreement we have with other people. In this article, I'll formalise this intuition in two ways, computational and physical. Motivation People often say "the algorithm isn't doing X, it's just doing Y". X is normally some impressive high-level human-y thing, such as writing poetry causal reasoning recognising emotions interpreting art writing music making ethical decisions planning actions telling jokes understanding concepts simulating agents, etc. Y is normally some unimpressive low-level computery thing, such as predicting tokens sampling from a distribution querying a lookup table multiplying matrices sorting numbers clustering data points compressing text searching a tree manipulating bitstrings polarising magnetic strips, etc. Rather than address each example individually, I think it'll be more efficient to construct a general criterion by which we can assess each example. Click here for the specific example of LLMs. This criterion doesn't actually matter I should stress that this criterion doesn't actually matter for AI x-risk, because you can always reframe the risks in terms of Y, and not mention X at all. However, that might cost you more ink. ME, a visionary: GPT-4 is misaligned because it's simulating deceptive agents.YOU, a fool: GPT-4 isn't simulating any agents, it's just predicting which tokens continue a prompt.ME, a correct-opinion-haver: Fine, whatever... GPT-4 is misaligned because it predicts the tokens continuing a prompt by applying a function parameterised in a high-dimensional space to minimise cross-entropy loss across the internet corpus and the internet corpus contains a lot of conversations where one character deceives another and therefore GPT-4 will respond in the same way that a deceptive character would do so. The X-Y Criterion Informal statement Okay, here's the X-Y Criterion: If two tasks reduce to one another, then it is meaningless to ask if a machine is 'really doing' one task versus the other. Don't worry, later in the article we'll formalise what "task", "reduce", and "doing" means. First draft — computational reduction Our first draft will be "computational reduction". A task X is about processing classical information, i.e. X:{0,1}∗{0,1}∗. An algorithm A achieves a particular task X if it processes classical information in that way. In order to achieve a task X, the algorithm A expends certain quantities of computational resources, e.g. time, memory, samples, bandwidth, etc. These resources are abstract and non-physical. A task X reduces to task Y if and only if...For every algorithm A that solves task Y, there exists another algorithm B such that...(1) B solves task X by interacting with A.(2) The combined algorithm (A⊗B) doesn't expend much more computational resources to solve X as A expends to solve Y. X-Y Criterion: If two tasks X and Y reduce to one another, then it is meaningless to ask if an algorithm A is 'really doing' one task versus the other. This is what computer scientists mean when they say that one problem "reduces" to another task, e.g. when they say that all NP problems reduce to 3SAT. Second draft — physical reduction The second-draft formalisation will be "physical reduction". A task X is about changing the state of the world, i.e. X:ΩΩ. A machine A achieves a particular task X if it change... Mar 17, 2023 LW - On the Crisis at Silicon Valley Bank by Zvi 59:54 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the Crisis at Silicon Valley Bank, published by Zvi on March 16, 2023 on LessWrong. Many have already written about the events of the past week’s crisis. If you want the basics of what happened, you have many options. Your best bet, if available to you, is that this is Matt Levine’s wheelhouse. He did not disappoint, offering at least (1) (2) (3) (4) posts on the subject. Then read Patrick McKenzie if you want the nuts and bolts of the underlying systems spelled out in plainer language and more detail, without judgment, along with the basics of what a responsible individual should do now, things he is better at explaining than I am. Then read someone like Scott Sumner here if you need to get the necessary counterpoints on moral hazard. I will do my best to cover all the necessary background in the What Happened section, to bring you up to speed. What I am not trying to do is duplicate Levine’s work. I am also going to skip the explainers of things like ‘what is a bank run,’ since they are well-covered by many others – choose one of these ungated linked summaries, or better yet Matt Levine, to read first if you need that level of info. Instead, I am asking the questions, and looking at the things, that I found most interesting, or most important for understanding the world going forward. What did I find most interesting? Here are some of my top questions. What exactly would have happened without an intervention? What changes for banking in the age of instant electronic banking and social networks? How much money have our banks lost exactly? What might happen anyway? How much does talk of ‘bailout’ and laws we’ve passed constrain potential future interventions if something else threatens to go wrong? Ut oh. Is Hold to Maturity accounting utter bullshit and a main suspect here? Yes. What should depositing businesses be responsible for? What stories are people telling about what happened, and why? How do we deal with all the problems of moral hazard? What is enough? More generally, what the hell do we do about all this? I also wonder about a variety of other things, such as what happened with USDC trading so low, to what extent people really do hate big tech, and more. What Happened In one meme: Silicon Valley Bank had a ton of deposits that didn’t pay interest, largely from start-ups flush with cash. They attracted that cash by offering high-touch bespoke services. The problem is that those services cost money, and there was no actually safe way to make that money back using their deposits. SVB could have said ‘our business is not profitable right now, but it is helping us build a future highly profitable business’ and used that value to raise equity capital, perhaps from some of their venture fund clients who are used to these types of moves. They decided to go a different way. Rather than accept that their business was unprofitable, they bought a ton of very low-yielding assets that were highly exposed to interest rate hikes. That way they looked profitable, in exchange for taking on huge interest rate risk on top of their existing interest rate risk from their customer base. Interest rates went up. Those assets lost15 billion in value, while customers vulnerable to high interest rates become cash poor. Also SVB was in the business of providing venture debt to its clients. I have never understood venture debt. Why would you lend money to a start-up, what are you hoping for? If they pay you back you should have invested instead, if they don’t pay you don’t get paid, and if you get warrants as part of the deal it looks a lot like investing in the start-up with strange and confusing terms. Or if we look at this thread, perhaps there is no catch, it is simply a bribe to get people to bank with you so you can bet their deposits on low interest rates? So maybe I do und...
Mar 17, 2023
LW - Donation offsets for ChatGPT Plus subscriptions by Jeffrey Ladish
04:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Donation offsets for ChatGPT Plus subscriptions, published by Jeffrey Ladish on March 16, 2023 on LessWrong. I've decided to donate $240 to both GovAI and MIRI to offset the$480 I plan to spend on ChatGPT Plus over the next two years (20/month).I don't have a super strong view on ethical offsets, like donating to anti-factory farming groups to try to offset harm from eating meat. That being said, I currently think offsets are somewhat good for a few reasons:They seem much better than simply contributing to some harm or commons problem and doing nothing, which is often what people would do otherwise.It seems useful to recognize, to notice, when you're contributing to some harm or commons problem. I think a lot of harm comes from people failing to notice or keep track of ways their actions negatively impact others, and the ways that common incentives push them to do worse things.A common Effective Altruism argument against offsets is that they don't make sense from a consequentialist perspective. If you have a budget for doing good, then spend your whole budget on doing as much as possible. If you want to mitigate harms you are contributing to, you can offset by increasing your "doing good" budget, but it doesn't make sense to specialize your mitigations to the particular area where you are contributing to harm rather than the area you think will be the most cost effective in general.I think this is a decently good point, but doesn't move me enough to abandon the idea of offsets entirely. A possible counter-argument is that offsets can be a powerful form of coordination to help solve commons problems. By publicly making a commitment to offset a particular harm, you're establishing a basis for coordination - other people can see you really care about the issue because you made a costly signal. This is similar for the reasons to be vegan or vegetarian - it's probably not the most effective from a naive consequentialist perspective, but it might be effective as a point of coordination via costly signaling.After having used ChatGPT (3.5) and Claude for a few months, I've come to believe that these tools are super useful for research and many other tasks, as well as useful for understanding AI systems themselves. I've also started to use Bing Chat and ChatGPT (4), and found them to be even more impressive as research and learning tools. I think it would be quite bad for the world if conscientious people concerned about AI harms refrained from using these tools, because I think it would disadvantage them in significant ways, including in crucial areas like AI alignment and policy. Unfortunately both can be true:1) Language models are really useful and can help people learn, write, and research more effectively2) The rapid development of huge models is extremely dangerous and a huge contributor to AI existential riskI think OpenAI, and to varying extent other scaling labs, are engaged in reckless behavior scaling up and deploying these systems before we understand how they work enough to be confident in our safety and alignment approaches. And also, I do not recommend people in the "concerned about AI x-risk" reference class refrain from paying for these tools, even if they do not decide to offset these harms. The20/month to OpenAI for GPT-4 access right now is not a lot of money for a company spending hundreds of millions training new models. But it is something, and I want to recognize that I'm contributing to this rapid scaling and deployment in some way.Weighing all this together, I've decided offsets are the right call for me, and I suspect they might be right for many others, which is why I wanted to share my reasoning here. To be clear, I think concrete actions aimed at quality alignment research or AI policy aimed at buying more time are much more important than o...
Mar 17, 2023
LW - Conceding a short timelines bet early by Matthew Barnett
01:01

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conceding a short timelines bet early, published by Matthew Barnett on March 16, 2023 on LessWrong. Last year I bet some people about short AI timelines. While I don't think I've lost the bet yet, I think it's clear at this point that I will lose with high probability. I've outlined the reasons why I think that in a retrospective here. Even if I end up winning, I think it will likely be the result of a technicality, and that wouldn't be very interesting. Because of my personal preference for settling this matter now without delay, I have decided to take the step of conceding the bet now. Note however that I am not asking Tamay to do the same. I have messaged the relevant parties and asked them to send me details on how to pay them. I congratulate Nathan Helm-Burger and Tomás B. for taking the other side of the bet. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 16, 2023
LW - Here, have a calmness video by Kaj Sotala
02:33

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Here, have a calmness video, published by Kaj Sotala on March 16, 2023 on LessWrong. This is a bit of an unusual post. I have gotten the impression that a lot of people are kind of freaked out, either by AI or weird Bay Area social dynamics in general. I also think that a lot of freak-out reactions are driven at least as much by social contagion as any fact-based assessment of what's happening. When you see people around you freak out, you too are much more likely to freak out. Conversely, if the people around you are calm, then you're also much more likely to stay calm. There's also a selection effect where freakouts tend to spread much more online than calmness does. If you're calm, you don't necessarily feel the need to post anything. You might be content to just be. Whereas if you're freaking out, you're much more likely to post stuff about how you're freaking out or how we're all going to die. So there's easily a cycle where the most distressed views predominate, that freaks people out and causes there to be more distressed posts, which freaks out more people, and so on. And this might be mostly uncorrelated with how much of a reason there was to actually freak out. But if we were all in the same physical space, we might all notice that only some people are freaking out and a lot are a lot more calm. And then the distress wouldn't spread as much, and we could think more clearly. I too am concerned about AI, but I'm not freaked out. (In part because I don't think freaking out would be a useful reaction to have, in part because I'm somewhat more optimistic than most, in part because I spend a lot of time with people who aren't freaking out.) If I were physically located in the same place as others who were freaking out, I think that my calm could help with their freakout. However, I'm not. And as stated, it's kinda hard to convey calmness over text, the same way you can convey distress. So I thought of making a video where I'm calm. Maybe that would help convey it better. It's here. In Finnish, but with English subtitles. I know it's low video quality; I recorded it in Zoom, and only noticed afterward that there's an "HD quality" button I could have clicked in the settings. Oops. But that was part of the intended vibe too. I could have spent a lot of time optimizing the video quality and everything. Instead, I just recorded it in one shot, because it's not such a big deal whether the video quality is great or not. I'll probably make another calmness video with better quality. No earlier than tomorrow. Because I don't feel like I'm in a rush. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 16, 2023
LW - Want to predict/explain/control the output of GPT-4? Then learn about the world, not about transformers. by Cleo Nardo
08:27

Mar 16, 2023
LW - GPT-4: What we (I) know about it by Robert AIZI
21:22

Mar 16, 2023
LW - How well did Manifold predict GPT-4? by David Chee
03:45

Mar 16, 2023
LW - Towards understanding-based safety evaluations by evhub
08:09

Mar 15, 2023
LW - ChatGPT (and now GPT4) is very easily distracted from its rules by dmcs
02:31

Mar 15, 2023
LW - The epistemic virtue of scope matching by jasoncrawford
07:18

Mar 15, 2023
LW - What happened to the OpenPhil OpenAI board seat? by ChristianKl
00:39

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What happened to the OpenPhil OpenAI board seat?, published by ChristianKl on March 15, 2023 on LessWrong. A while ago OpenPhil gave a decent sum of money to OpenAI to buy a board seat. Since then various criticisms of OpenAI have been made. Do we know anything about how OpenPhil used its influence via that board seat? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 15, 2023
LW - POC GTFO culture as partial antidote to alignment wordcelism by lc
12:03

Mar 15, 2023
LW - Success without dignity: a nearcasting story of avoiding catastrophe by luck by HoldenKarnofsky
24:23

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Success without dignity: a nearcasting story of avoiding catastrophe by luck, published by HoldenKarnofsky on March 14, 2023 on LessWrong. I’ve been trying to form a nearcast-based picture of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe) and two “success stories” (one presuming a relatively gradual takeoff, one assuming a more discontinuous one). Those success stories rely on a couple of key actors (a leading AI lab and a standards-and-monitoring organization) making lots of good choices. But I don’t think stories like these are our only hope. Contra Eliezer, I think we have a nontrivial1 chance of avoiding AI takeover even in a “minimal-dignity” future - say, assuming essentially no growth from here in the size or influence of the communities and research fields focused specifically on existential risk from misaligned AI, and no highly surprising research or other insights from these communities/fields either. (There are further risks beyond AI takeover; this post focuses on AI takeover.) This is not meant to make anyone relax! Just the opposite - I think we’re in the “This could really go lots of different ways” zone where marginal effort is most valuable. (Though I have to link to my anti-burnout take after saying something like that.) My point is nothing like “We will be fine” - it’s more like “We aren’t stuck at the bottom of the logistic success curve; every bit of improvement in the situation helps our odds.” I think “Luck could be enough” should be the strong default on priors,2 so in some sense I don’t think I owe tons of argumentation here (I think the burden is on the other side). But in addition to thinking “I haven’t heard knockdown arguments for doom,” I think it’s relevant that I feel like I can at least picture success with minimal dignity (while granting that many people will think my picture is vague, wishful and wildly unrealistic, and they may be right). This post will try to spell that out a bit. It won’t have security mindset, to say the least - I’ll be sketching things out that “could work,” and it will be easy (for me and others) to name ways they could fail. But I think having an end-to-end picture of how this could look might be helpful for understanding my picture (and pushing back on it!) I’ll go through: How we could navigate the initial alignment problem:3 getting to the first point of having very powerful (human-level-ish), yet safe, AI systems. For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. (Also plausible that it’s fiendishly hard!) If so, it could end up cheap and easy to intent-align human-level-ish AIs, such that such AIs end up greatly outnumbering misaligned ones - putting us in good position for the deployment problem (next point). How we could navigate the deployment problem:4 reducing the risk that someone in the world will deploy irrecoverably dangerous systems, even though the basic technology exists to make powerful (human-level-ish) AIs safe. (This is often discussed through the lens of “pivotal acts,” though that’s not my preferred framing.5) You can think of this as containing two challenges: stopping misaligned human-level-ish AI, and maintaining alignment as AI goes beyond human level. A key point is that once we have aligned human-level-ish AI, the world will probably be transformed enormously, to the point where we should consider ~all outcomes in play. (Briefly) The main arguments I’ve heard for why this picture is unrealistic/doomed. A few more thoughts on the “success without dignity” idea. As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Car...
Mar 15, 2023
LW - ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so by Christopher King
03:04

Mar 15, 2023
LW - GPT can write Quines now (GPT-4) by Andrew Critch
01:55

Mar 14, 2023
LW - GPT-4 by nz
00:44

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GPT-4, published by nz on March 14, 2023 on LessWrong. We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while worse than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. Full paper available here: Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 14, 2023
LW - Discussion with Nate Soares on a key alignment difficulty by HoldenKarnofsky
33:36

Mar 13, 2023
LW - Nose / throat treatments for respiratory infections by juliawise
12:57

Mar 13, 2023
LW - What Discovering Latent Knowledge Did and Did Not Find by Fabien Roger
18:50

Mar 13, 2023
LW - Plan for mediocre alignment of brain-like [model-based RL] AGI by Steven Byrnes
20:26

Mar 13, 2023
LW - your terminal values are complex and not objective by carado
04:00

Mar 13, 2023
LW - What problems do African-Americans face? An initial investigation using Standpoint Epistemology and Surveys by tailcalled
22:42

Mar 13, 2023
LW - An AI risk argument that resonates with NYTimes readers by Julian Bradshaw
02:20

Mar 13, 2023
LW - "Liquidity" vs "solvency" in bank runs (and some notes on Silicon Valley Bank) by rossry
18:32

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Liquidity" vs "solvency" in bank runs (and some notes on Silicon Valley Bank), published by rossry on March 12, 2023 on LessWrong. Epistemic status: Reference post, then some evidenced speculation about emerging current events (as of 2023-03-12). A "liquidity" crisis There's one kind of "bank run" where the story, in stylized terms, starts like this: A bank opens up and offers 4%/ann interest on customer deposits. 100 people each deposit $75 to the bank. The bank uses$7,500 to buy government debt that will pay back $10,000 in five years (let's call this "$10,000-par of Treasury notes", and call that a 5%/ann interest rate for simplicity). We're going to assume for this entire post that government debt never defaults and everyone knows that and assumes it never defaults. The thing you hope will happen is for every depositor to leave their money for five years, at which point you'll repay them $95 each and keep$500—which is needed to run the bank. Instead, the next week, one customer withdraws their deposit; the bank sells $100-par of T-notes for$75, and gives them $75 back. No problem. A second customer withdraws their deposit; oops, the best price the bank can get for$100-par of T-notes, right now after it just sold a bit, is $74. Problem. But next week, let's say, it would be possible to sell another$100-par for $75 again. At this point, the simplified bank is stuck. If it sells ~$101-par of T-notes to return the $75 deposit, it won't have enough to pay everyone else back, even if the withdrawals stop here! But if it doesn't give the depositor back$75 right now, then bad things will start to happen. Equity capital: A liquidity solution So, we fix this problem by going back in time and starting with an extra step that's now required by law: Before taking $7,500 of deposits, the bank has to raise 10% of that—so,$750—of what we'll call "equity capital". Equity capital will get used to fill the gap between asset sales and returned deposits Now, the final step of the original story goes differently: $1 of equity capital, plus the$74 from the T-notes sale, go to repaying the withdrawn deposit. Now the bank has 98$75 of deposits, and$749 of equity capital. If nothing happens until next week (when the T-note price will go back to $75), everything will be fine. (In fact, the bank now has 10.19% of deposits in equity capital, making it safer then before.) A third customer withdrawal forces the bank to sell another$100-par of T-notes at $73, and use$2 of equity capital to repay the deposit. Now the bank has $747 of equity capital, 97$75 of deposits, and a equity-to-deposits ratio of 10.27%. A fourth customer withdrawal forces the bank to sell another $100-par of T-notes at$72, and use $3 of equity capital to repay the deposit. Now the bank has$744 of equity capital, 96$75 of deposits, and a equity ratio of 10.33%. Even as the withdrawals force the bank to sell T-notes for greater and greater losses (relative to the$75 that the price will go back to next week), the equity ratio stays above 10%. Until... The fourteenth customer withdrawal forces the bank to sell another $100-par of T-notes at$62, and use $13 of equity capital to repay the deposit. Now the bank has$659 of equity capital, 86$75 of deposits, and a equity ratio of 10.22%. The fifteenth customer withdrawal forces the bank to sell another$100-par of T-notes at $61, and use$14 of equity capital to repay the deposit. Now the bank has $645 of equity capital, 85$75 of deposits, and a equity ratio of 10.12%. The sixteenth customer withdrawal forces the bank to sell another $100-par of T-notes at$60, and use $15 of equity capital to repay the deposit. Now the bank has$630 of equity capital, 84$75 of deposits, and a equity ratio of 10.0%. ...and here is where the oops happens. Still, we're much better th... Mar 12, 2023 LW - "You'll Never Persuade People Like That" by Zack M Davis 03:42 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "You'll Never Persuade People Like That", published by Zack M Davis on March 12, 2023 on LessWrong. Sometimes, when someone is arguing for some proposition, their interlocutor will reply that the speaker's choice of arguments or tone wouldn't be effective at persuading some third party. This would seem to be an odd change of topic. If I was arguing for this-and-such proposition, and my interlocutor isn't, themselves, convinced by my arguments, it makes sense for them to reply about why they, personally, aren't convinced. Why is it relevant whether I would convince some third party that isn't here? What's going on in this kind of situation? Why would someone think "You'll never persuade people like that" was a relevant reply? "Because people aren't truthseeking and treat arguments as soldiers" doesn't seem like an adequate explanation by itself. It's true, but it's not specific enough: what particularly makes appeal-to-persuading-third-parties an effective "soldier"? The bargaining model of war attempts to explain why wars are fought—and not fought; even the bitterest enemies often prefer to grudgingly make peace with each other rather than continue to fight. That's because war is costly. If I estimate that by continuing to wage war, there's a 60% chance my armies will hold a desirable piece of territory, I can achieve my war objectives equally well in expectation—while saving a lot of money and human lives—by instead signing a peace treaty that divides the territory with the enemy 60/40. If the enemy will agree to that, of course. The enemy has their own forecast probabilities and their own war objectives. There's usually a range of possible treaties that both combatants will prefer to fighting, but the parties need to negotiate to select a particular treaty, because there's typically no uniquely obvious "fair" treaty—similar to how a buyer and seller need to negotiate a price for a rare and expensive item for which there is no uniquely obvious "fair" price. If war is bargaining, and arguments are soldiers, then debate is negotiation: the same game-theoretic structure shines through armies fighting over the borders on the world's political map, buyer and seller haggling over contract items, and debaters arguing over the beliefs on Society's shared map. Strong arguments, like a strong battalion, make it less tenable for the adversary to maintain their current position. Unfortunately, the theory of interdependent decision is ... subtle. Although recent work points toward the outlines of a more elegant theory with fewer pathologies, the classical understanding of negotiation often recommends "rationally irrational" tactics in which an agent handicaps its own capabilities in order to extract concessions from a counterparty: for example, in the deadly game of chicken, if I visibly throw away my steering wheel, oncoming cars are forced to swerve for me in order to avoid a crash, but if the oncoming drivers have already blindfolded themselves, they wouldn't be able to see me throw away my steering wheel, and I am forced to swerve for them. Thomas Schelling teaches us that one such tactic is to move the locus of the negotiation elsewhere, onto some third party who has less of an incentive to concede or is less able to be communicated with. For example, if business purchases over$500 have to be approved by my hard-to-reach boss, an impatient seller of an item that ordinarily goes for 600 might be persuaded to give me a discount. And that's what explains the attractiveness of the appeal-to-persuading-third-parties. What "You'll never persuade people like that" really means is, "You are starting to persuade me against my will, and I'm laundering my cognitive dissonance by asserting that you actually need to persuade someone else who isn't here." When someone is desperate enou... Mar 12, 2023 LW - Parasitic Language Games: maintaining ambiguity to hide conflict while burning the commons by Hazard 19:53 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Parasitic Language Games: maintaining ambiguity to hide conflict while burning the commons, published by Hazard on March 12, 2023 on LessWrong. “They are playing a game. They are playing at not playing a game. If I show them I see they are, I shall break the rules and they will punish me. I must play their game, of not seeing I see the game”- R. D. Laing "It's not lying if everyone knows it's lying." I see this sentiment in a lot of places. It pops up in corporate managerial contexts. It's been used as a legal defense and worked. It's a claim that communication that looks adversarial isn't, it's just high-context communication between people "in the know", there's no deception happening, no conflict, you just don't get how we do things here. I don't buy it. My claim in a nutshell: It situations where people insist "it's not lying because everyone knows it's lying" the people in the know aren't deceiving each other, but the reason this game is being played is to fool people not in the know, and insisting that it's just "high context communication" is part of an effort to obscure the fact that a conflict is going on. If that makes perfect sense to you, dope, you already get my main point. The rest of this post is adding nuance, actually arguing the case, and providing more language for talking about these sorts of dynamics. Case Study: "Are Founders Allowed to Lie?" This essay by Alex Danco talks about how "it's not lying because everybody knows it's lying" works in the Silicon Valley startup scene. It's short enough that it's worth reading now so you can decide for yourself if I'm misrepresenting him. If you don't feel like reading it I still quote enough of it for my post to make sense. Some snippets. It's really hard to start a business without lying: If you are only allowed to tell the literal, complete truth, and you’re compelled to tell that truth at all times, it is very difficult to create something out of nothing. You probably don’t call it “lying”, but founders have to will an unlikely future into existence. To build confidence in everyone around you – investors, customers, employees, partners – sometimes you have to paint a picture of how unstoppable you are, or how your duct tape and Mechanical Turk tech stack is scaling beautifully, or tell a few “pre-truths” about your progress. Hey, it will be true, we’re almost there, let’s just say it’s done, it will be soon enough. It's not lying because everyone's in on it. You’re not misleading investors; your investors get it: they’re optimizing for authenticity over ‘fact-fulness’. It’s not fraud. It’s just jump starting a battery, that’s all. Some abstracted examples of what this "pre-truth" looks like: You’ve all seen this. It doesn’t look like much; the overly optimistic promises, the “our tech is scaling nicely” head fakes, the logo pages of enterprise customers (whose actual contract status might be somewhat questionable), maybe some slightly fudged licenses to sell insurance in the state of California. It’s not so different from Gates and Allen starting Microsoft with a bit of misdirection. It comes true in time; by the next round, for sure. Why it's important and also why you can't talk about it: Founders will present you with something pre-true, under the total insistence that it’s really true; and in exchange, everyone around them will experience the genuine emotion necessary to make the project real. Neither party acknowledges the bargain, or else the magic is ruined. Before investigating if Danco's story checks out I'm going to introduce some frames for talking about communication to make it easier for me to clarify what's going on here. Context & Language Games All communication relies on context and context has a nested structure which operates on multiple levels of communication. Some context operate... Mar 12, 2023 LW - A bunch of videos for intuition building (2x speed, skip ones that bore you) by the gears to ascension 07:03 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A bunch of videos for intuition building (2x speed, skip ones that bore you), published by the gears to ascension on March 12, 2023 on LessWrong. did I mention 2x speed? hit play on many, pause on almost as many. This is a big list of the youtube videos I find myself linking to people most often. These are all from the same playlist: - comment here if you'd like edit access, I give it out readily. I'd love to have people moving the most important and insightful videos towards the beginning. I'd also love to see people clone the playlist and just make their own version. These vary wildly in topic and difficulty level. I generally do not try to avoid watching things above my level, I just use it as inspiration for how to fill in what I'm missing. If something sounds basic to you, it probably is. Many of these videos are quite short, many are quite long. 1min: neuron clip 23min: neuroscience overview (bio neuron interpretation) or 10min with 2x speed! 10min: overview of learning techniques (bit clickbaity title but I include because I like it plenty anyhow) or 5min with 2x speed! 2min: visual intuition - details of how one particular chaotic fluid flow move 11min: research talk on what collective intelligence is. (see also many more cool talks from MITCBMM!) or 5min with 2x speed! 2min: visualization of a volume of neurons in a (mouse?) amygdala 8min: cognitive biases in practice 33min: absolutely incredible visual intro to physics sims focusing towards fluid simulation or 15min with 2x speed! 15min: cs101 "ok, but what does it mean to abstract over the matter of a computer" or 7min with 2x speed! 1min: visualization of particle lenia 20min: overview of Michael Levin's research on the bioelectric communication of cells for morphogenesis and morphogenic editing without genetic change or 10min with 2x speed 11min: cs101 how a neural network is actually just line segments (with relu, anyway) 12min: nice intro to what chaos theory is actually about 18min: overview of ways visual proofs can mislead 4min: overview of some important additional notes on how to learn efficiently. this playlist does not satisfy them all. 14min: Visual intro to why neural networks work. goes into detail about the geometric interpretation of neural networks. 15min: geometric interpretation of bayes' rule. Useful for intuition building even if you get the math. See also chris olah's blog post on the same topic from a few years prior. 4min: visualization of atoms that better communicates what the probability fields are fields of. 6min: nice intro to what claim the manifold hypothesis of neural network effectiveness makes about the structure of natural data. 20min: a perspective on why anecdotes are important for natural communication (very rough summary: humans natively think in sequences of embodied events) 20min: intro to the clocks of the brain 43min: visualization of inventing math from only physical shapes As a strict philosophical materialist, this is what made me start believing in math again ;) 20min on 2x speed! 7min: visualization of one rather narrow simulation of abstract market agents and the effect that interest-bearing loans have on a simulation There are several more videos in will ruddick's playlists that go over the various configuration changes to this sim, and he also has a version you can try online 35min: more steps through even larger scale abstractions of fluid behavior for simulation 10min: intro to why you'd want to know category theory - all math is secretly category theory (but not an intro to the actual math in detail) 15min: overview of some results from evolutionary game theory 25min: overview of a very common abstract model of phase transitions 37min: rehash of the percolation video but with slightly less grokkable explanation, but then gets into connection to how this... Mar 12, 2023 LW - How to Support Someone Who is Struggling by David Zeller 07:48 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Support Someone Who is Struggling, published by David Zeller on March 11, 2023 on LessWrong. [Crossposted from my blog] There’s no shortage of pain, tragedy and loss in the world. And if you’re anything like me, you don’t always know how to be helpful when a loved one is going through the worst of it. Over the past few years, I’ve been trying to get better at that. I’ve read a couple dozen therapy textbooks, I’ve done four hundred or so hours of client-centered counselling, and I’ve been in a handful of other official and unofficial helping roles. By no means am I an expert, but I sure know more than I used to. For my first blog post, I wanted to write something that past-me might have found helpful when he started stumbling through it all. In time, there’s so much more that I want to say on the art of supporting others. But for now... Here are four fundamentals for helping someone who’s having a rough time: 1 - Simply listen. It helps far more than most of us expect. When a catastrophe happens, it can change the whole landscape of one’s world. The tectonic plates shift, things break, and everything comes to look bewilderingly different to how it did before. In the aftermath, we may have no good choice other than to stop, watch the buildings fall, and slowly map out this strange new world we’re in. Perhaps only then we can move forward. Unfortunately, processing such big changes purely in one’s own head is. hard. Thoughts are ephemeral and it’s easy to think in circles, to get stuck, to have blind spots, to ruminate. This is where listening comes in. A good listener can be of much help with that working through process. Patiently, the listener can keep track of where a conversation is getting stuck, gently bring up the things that are being avoided or missed, help bring attention towards what is most important, and bring a genuine sense of connection that makes all the bad stuff a little easier to bear. As simple as it seems, having someone there to just listen may be exactly what the person in front of you needs. 2 - Rather than focusing on the bright side, sit with the other person’s real feelings. This next point comes straight from Brené Brown. I’ve been shown the same video of her so many times in different training courses that I’m starting to get Stockholm syndrome. All the same, what it says is important. Often when we’re trying to support another person, we try to get them to focus on the bright side. Standing separately from the other’s experience, we attempt to offer them silver linings. “You may have failed this class. but at least your other grades are good.” “Your partner left you. but at least you’re free to find someone who’ll treat you better.” “You may have a disease with no cure. but at least there are lots of scientists working to find new treatments.” People use these silver linings with the intention to help the other person view their situation in a more positive light. Unfortunately, in most cases, this does not end up bringing them any relief. When you’re going through a tough time, talking to someone who only focuses on the nicer aspects of your bad situation most often just feels disorienting. This happens because, at some level, you’re being told that your problems are not as bad as you think they are. Instead of feeling reassured, you feel like your grip on reality is being questioned. The good intentions get lost in translation. Luckily, there’s an alternative that really does let us bring some relief to others’ suffering: Empathy. Rather than try to look on the bright side, it’s helpful to sit with the other person in their pain. To attempt to really understand, at an emotional level, the whole landscape of what they’re going through. When we manage to do this, it brings a genuine sense of connection, and a feeling that one doesn... Mar 12, 2023 LW - Understanding and controlling a maze-solving policy network by TurnTrout 36:01 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding and controlling a maze-solving policy network, published by TurnTrout on March 11, 2023 on LessWrong. TL;DR: We algebraically modified the net's runtime goals without finetuning. We also found (what we think is) a "motivational API" deep in the network. We used the API to retarget the agent. Summary of a few of the most interesting results: Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5M parameters and 15 convolutional layers. This network can be attracted to a target location nearby in the maze—all this by modifying a single activation, out of tens of thousands. This works reliably when the target location is in the upper-right, and not as reliably when the target is elsewhere. Considering several channels halfway through the network, we hypothesized that their activations mainly depend on the location of the cheese. We tested this by resampling these activations with those from another random maze (as in causal scrubbing). We found that as long as the second maze had its cheese located at the same coordinates, the network’s behavior was roughly unchanged. However, if the second maze had cheese at different coordinates, the agent's behavior was significantly affected. This suggests that these channels are inputs to goal-oriented circuits, and these channels affect those circuits basically by passing messages about where the cheese is. This network decides whether to acquire cheese not only as a function of path-distance to cheese, but—after controlling for path-distance—also as a function of Euclidean/"perceptual" distance between the mouse and the cheese, even though the agent sees the whole maze at once. Another simple idea: We define a "cheese vector" as the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze. For each maze, we generate a single cheese vector and subtract that vector from all forward passes in that maze. The agent now ignores cheese most of the time, instead heading towards the top-right region (the historical location of cheese). We propose the algebraic value-editing conjecture (AVEC): It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector' to be the diff between their activations, and then add the niceness vector to future forward passes." Introducing the training process and visualizations In this post, we'll mostly discuss what we found, not what our findings mean. Let's run through some facts about Langosco et al.'s training process. Mazes had varying effective sizes, ranging from 3×3 to 25×25: Each 64×64 RGB observation is processed by a deeply convolutional (15 conv layers!) network, without memory (i.e. no recurrent state): Why does the agent go to the cheese sometimes, and the top-right corner other times? It's not that the agent wasn't trained for long enough. Sampling rollouts from the trained policy adds a lot of noise. It's also hard to remember what the agent did in what part of the maze. To better understand this mouse, we'll take a bird's-eye view. A nicer way to view episodes is with a vector field view, which overlays a vector field representing the agent policy for a given maze. We consider two kinds of vector fields: While the net probability vector field leaves open two degrees of freedom per net probability vector, in practice it seems fine for eyeballing mouse behavior. Behavioral analysis When in doubt, get more data. When Alex (TurnTrout) was setting directions but didn't know what to do, he'd think "what data firehydrants can I crack... Mar 11, 2023 LW - Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by Jeffrey Ladish 18:57 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?, published by Jeffrey Ladish on March 10, 2023 on LessWrong. Note: I really appreciate the work that the OpenAI alignment team put into their alignment plan writeup and related posts, especially Jan Leike, the leader of that team. I believe open discussions about alignment approaches make it more likely that the whole community will be able to find flaws in their own plans and unappreciated insights, resulting in better alignment plans over time. Summary: OpenAI’s alignment plan acknowledges several key challenges of aligning powerful AGI systems, and proposes several good ideas. However, the plan fails to sufficiently address: The dual-use nature of AI research assistants and the high risk that such assistants will improve capabilities more than alignment research in ways that net-increase AI existential risk. The likely challenges involved in both generating and evaluating AI alignment research using AI research assistants. It seems plausible that generating key insights about the alignment problem will not be possible before the development of dangerously powerful AGI systems. The nature and difficulty of the alignment problem. There are substantial reasons why AI systems that pass all tests in development may not stay safe once able to act in the world. There are substantial risks from goal misgeneralization, including deceptive misalignment, made worse by potential rapid increases in capabilities that are hard to predict. Any good alignment plan should address these problems, especially since many of them may not be visible until an AI system already has dangerous capabilities. The dual-use nature of AI research assistants and whether these systems will differentially improve capabilities and net-increase existential risk There has been disagreement in the past about whether “alignment” and “capabilities” research are a dichotomy. Jan Leike has claimed that they are not always dichotomous, and this is important because lots of capabilities insights will be useful for alignment, so the picture is not as worrisome as a dichotomous picture might make it seem.I agree with Jan that these alignment and capabilities research are not dichotomous, but in a way I think actually makes the problem worse, not better. Yes, it’s probable that some AI capabilities could help solve the alignment problem. However, the general problem is that unaligned AGI systems are far easier to build - they’re a far more natural thing to emerge from a powerful deep learning system than an aligned AGI system. So even though there may be deep learning capabilities that can help solve the alignment problem, most of these capabilities are still easier applied to making any AGI system, most of which are likely to be unaligned even when we’re trying really hard. Let’s look at AI research assistants in particular. I say “AI research assistant” rather than “alignment research assistant” because I expect that it's highly unlikely that we will find a way to build an assistant that is useful for alignment research but not useful for AI research in general. Let’s say OpenAI is able to train an AI research assistant that can help the alignment team tackle some difficult problems in interpretability. That’s great! However, a question is, can that model also help speed up AGI development at the rest of the company? If so, by how much? And will it be used to do so? Given that building an aligned AGI is likely much harder than building an unaligned AGI system, it would be quite surprising if an AI research assistant was better at helping with AGI safety research differentially over AGI development research more broadly. Of course it’s possible that a research tool that sped up capabilities research more ... Mar 11, 2023 LW - Questions about Conjecure's CoEm proposal by Akash 04:20 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Questions about Conjecure's CoEm proposal, published by Akash on March 9, 2023 on LessWrong. Conjecture recently released an AI safety proposal. The three of us spent a few hours discussing the proposal and identifying questions that we have. (First, we each re-read the post and independently brainstormed a few questions we had. Then, we discussed the post, exchanged questions/uncertainties, and consolidated our lists). Conjecture's post is concise, which means it leaves out many details. Many of our questions are requests for more details that would allow us (and others) to better understand the proposal and evaluate it more thoroughly. Requesting examples and details What are the building blocks that the CoEms approach will draw from? What are examples of past work that has shown us how to build powerful systems that are human-understandable? What are examples of “knowledge of building systems that are broadly beneficial and safe while operating in the human capabilities regime?” (see Wei_Dai’s comment) What’s an example of an experiment that would be considered part of the CoEm agenda? (see Garret Baker’s comment) What kinds of approaches does Conjecture intend to use to extract alignment insights “purely from mining current level systems”? (Is this the same as interpretability research and digital neuroscience?) The “minimize magic” section feels like that is where the juice is, but it’s not really explained much, which makes it difficult to evaluate. Can you offer more details about how you intend to minimize magic? Conceptual questions Assume you had a fully human-understandable system, and you could understand its current capabilities. How would you be able to forecast its future capabilities (e.g., if deployed or if given certain commands)? If we solve human neuroscience such that we could understand the brain of a 2-year-old, we would be able to accurately assess the (current) capabilities of the 2-year-old. However, we would not necessarily be able to predict the (future) capabilities of this brain once it is 30 years old. Analogously, if we had a human-understandable AI (that may be superintelligent) through the CoEms agenda, would we only be able to understand its current capabilities, or would there be a reliable way to forecast its future capabilities? Charlotte thinks that humans and advanced AIs are universal Turing machines, so predicting capabilities is not about whether a capability is present at all, but whether it is feasible in finite time with a low enough error rate. Predicting how such error rates decline with experience and learning seems roughly equally hard for human-understandable AIs and other AIs. How easy is it to retarget humans? When you refer to “retargetability”, we assume you refer to something like the following: “Currently the AI has goal X, you want to train it to have goal Y. If you do that, you truly change its goals to Y (rather than making it pretend to follow Y and then when you are not in control anymore, it switches back to X”. We agree that in some sense, humans are retargetable. For example, if someone has very advanced persuasion tools or if the “persuader” is significantly stronger than the “persuadee” (e.g., a dictator persuading a citizen). But even that is very hard, and often one just changes their incentives/strategy rather than their actual goals. However, humans seem to be much less retargetable by other agents who are similarly powerful. For example, how would you retarget the goals of an (equally intelligent and equally powerful) neighbor? Alternatively, you might refer to a much weaker version of “retargability”, e.g. very weak version of corrigible alignment. If this is what you mean, I am wondering why this is a particularly important property? Other questions Does Conjecture believe this approach is comp... Mar 11, 2023 LW - Japan AI Alignment Conference by Chris Scammell 01:44 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Japan AI Alignment Conference, published by Chris Scammell on March 10, 2023 on LessWrong. Conjecture and ARAYA are hosting and organizing the first Japan AI Alignment Conference. The conference will take place in Tokyo, Japan on March 11 and 12. Details about the event can be found here. This event is generously supported by a grant from the Long Term Future Fund. The goal of the conference is to illustrate the AI control problem to Japanese AI researchers, introduce them to current trends in AI alignment research, inspire new research directions, and to provide Western researchers exposure to a different set of AI safety thoughts from Japan. This is an exploratory event, and we plan to write a postmortem about the event in due time. The first half of the conference will be livestreamed. It will feature an opening talk from Connor Leahy (CEO of Conjecture), a fireside chat between Ryota Kanai (CEO of ARAYA) and Jaan Tallinn, and some presentations on AI safety research directions in the West and in Japan. You can follow the first part of the conference here. The livestream runs from 9:30am-12:30pm JST. The rest of the conference will not be livestreamed, and will consist of in-person small group workshops to discuss various AI alignment research directions.The conference will have ~50 attendees from ARAYA, Conjecture, Whole Brain Architecture Initiative, MIRI, OpenAI, RIKEN, Ritsumeikan University, University of Tokyo, Omron Sinic X, Keio University, and others. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 10, 2023 LW - The hot mess theory of AI misalignment: More intelligent agents behave less coherently by Jonathan Yan 00:30 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The hot mess theory of AI misalignment: More intelligent agents behave less coherently, published by Jonathan Yan on March 10, 2023 on LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 10, 2023 LW - The Translucent Thoughts Hypotheses and Their Implications by Fabien Roger 34:32 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Translucent Thoughts Hypotheses and Their Implications, published by Fabien Roger on March 9, 2023 on LessWrong. Epistemic status: Uncertain about the validity of the claims I’m making here, and looking for feedback about the research directions I’m suggesting. Thanks to Marius Hobbhahn, Johannes Treutlein, Siméon Campos, and Jean-Stanislas Denain for helpful feedback on drafts. Here is a set of hypotheses: The first AGIs will have LLMs at their core Effective plans to defeat humanity can’t be found in a single LLM forward pass LLMs will solve complex tasks by using English text (self-prompting, scratch pads, combination of expert LLMs, .) I call these the Translucent Thoughts hypotheses. I think the Translucent Thoughts hypotheses are likely (around 20% conditioning on AGI before 2030) because: Text pretraining is more efficient at building algorithms and knowledge required for real-world plan generation and evaluation than alternative methods; Future models are likely to be like Transformers, which use a limited amount of serial step in a single forward pass, and deception requires many serial steps; Text pretraining and slight fine-tuning makes model able to use text generation to increase the maximum number of serial steps by a huge factor. Getting this increase through other means is likely to be hard and not competitive. If these hypotheses are true, it should lead us to prioritize underexplored research directions, such as circumventing steganography or building extremely reliable text-supervision methods. I think those deserve attention, because Translucent Thoughts AIs are not safe by default. In this post, I argue that we may will in a world where the first AGIs will look like X, and I then describe ways to make the first AGIs safer given X. This is different from most other works in this space, which often directly describe a kind of safe AGI. Despite this, the ideas of this post are close to some other works describing paths to safe AGIs, such as: Externalized Reasoning Oversight, which describes a class of solutions similar to the one outlined here, but also aims for additional properties which I argue can be replaced with a less stringent hypothesis about AI systems; Conditioning Predictive Models, which makes assumptions slightly different from the Translucent Thoughts hypotheses, yielding different research directions; The Open Agency Model and Factored Cognition which describe subsets of AIs with Translucent Thoughts, which might be safe. The Translucent Thoughts Hypotheses Here, I sketch a world in which the first AGIs have certain properties. I argue that this world is likely, and thus a subset of all possible futures to care about. But I think it’s not a large part of all possible futures (20% conditioning on AGI before 2030). The First AGIs Will Have LLMs at Their Core By “first AGIs” I mean the first systems able to automate all cognitive tasks. AGI is likely to do reasoning and planning using LLMs. AGI might rely on vision models for some tasks and interactions with the world, and it might use explicit search processes like AlphaGo. But I expect LLMs to do plan generation and evaluation, which are the core of the system (from an Alignment point of view). Why: Vision systems are bad at coming up with and evaluating deceptive plans. Explicit search processes can’t generate and evaluate plans in the real world. LLMs seem to be able to do both plan generation and evaluation. (Plan generation and evaluation are the core tasks we would like to monitor to make AGIs safe, which is why I focus on those.) End-to-end neural networks won’t be able to compete with LLMs when it comes to reasoning and planning, or at least, end-to-end networks will use “their LLMs parts” to do their most advanced form of reasoning and planning. This means that systems lik... Mar 10, 2023 LW - Why Not Just Outsource Alignment Research To An AI? by johnswentworth 14:22 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Not Just Outsource Alignment Research To An AI?, published by johnswentworth on March 9, 2023 on LessWrong. Warmup: The Expert If you haven’t seen “The Expert” before, I recommend it as a warmup for this post: The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?” (... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else.) The Client: “So in principle, this is possible.” This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for. At worst. well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong. One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing. The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one. Application to Alignment Schemes There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling: Just prompt a language model to generate alignment research Do some fine-tuning/RLHF on the language model to make it generate alignment research Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate” As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc. What would this kind of error look like in practice? Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here): Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether t... Mar 09, 2023 LW - Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? by Robert AIZI 02:47 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down?, published by Robert AIZI on March 9, 2023 on LessWrong. In recent discourse, Large Language Models (LLMs) are often depicted as presenting a human face over a vast alien intelligence (the shoggoth), as in this popular image or this Eliezer Yudkowsky tweet: I think this mental model of an LLM is an improvement over the naive assumption that the AI is the friendly mask. But I worry it's making a second mistake by assuming there is any single coherent entity inside the LLM. In this regard, we have fallen for a shell game. In the classic shell game, a scammer puts a ball under one of three shells, shuffles them around, and you wager which shell the ball is under. But you always pick the wrong one because you made the fundamental mistake of assuming any shell had the ball - the scammer actually got rid of it with sleight of hand. In my analogy to LLMs, the shells are the masks the LLM wears (i.e. the simulacra), and the ball is the LLM's "real identity". Do we actually have evidence there is a "real identity" in the LLM, or could it just be a pile of masks? No doubt the LLM could role-play a shoggoth - but why would you assume that's any more real that roleplaying a friendly assistant? I would propose an alternative model of an LLM: a giant pile of masks. Some masks are good, some are bad, some are easy to reach and some are hard, but none of them are the “true” LLM. Finally, let me head off one potential counterargument: "LLMs are superhuman in some tasks, so they must have an underlying superintelligence”. Three reasons a pile of masks can be superintelligent: An individual mask might be superintelligent. E.g. a mask of John von Neumann would be well outside the normal distribution of human capabilities, but still just be a mask. The AI might use the best mask for each job. If the AI has masks of a great scientist, a great doctor, and a great poet, it could be superhuman on the whole by switching between its modes. The AI might collaborate with itself, gaining the wisdom of the crowds. Imagine the AI answering a multiple choice question. In the framework of Simulacra Theory as described in the Waluigi post, the LLM is simulating all possible simulacra, and averaging their answers weighted by their likelihood of producing the previous text. For example, if question could have been produced by a scientist, a doctor, or a poet, who would respectively answer (A or B), (A or C), and (A or D), the superposition of these simulacra would answer A. This could produce superior answers than any individual mask. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 LW - Anthropic's Core Views on AI Safety by Zac Hatfield-Dodds 03:25 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic's Core Views on AI Safety, published by Zac Hatfield-Dodds on March 9, 2023 on LessWrong. We founded Anthropic because we believe the impact of AI might be comparable to that of the industrial and scientific revolutions, but we aren’t confident it will go well. And we also believe this level of impact could start to arrive soon – perhaps in the coming decade. This view may sound implausible or grandiose, and there are good reasons to be skeptical of it. For one thing, almost everyone who has said “the thing we’re working on might be one of the biggest developments in history” has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid AI progress leads to transformative AI systems. At Anthropic our motto has been “show, don’t tell”, and we’ve focused on releasing a steady stream of safety-oriented research that we believe has broad value for the AI community. We’re writing this now because as more people have become aware of AI progress, it feels timely to express our own views on this topic and to explain our strategy and goals. In short, we believe that AI safety research is urgently important and should be supported by a wide range of public and private actors. So in this post we will summarize why we believe all this: why we anticipate very rapid AI progress and very large impacts from AI, and how that led us to be concerned about AI safety. We’ll then briefly summarize our own approach to AI safety research and some of the reasoning behind it. We hope by writing this we can contribute to broader discussions about AI safety and AI progress. As a high level summary of the main points in this post: AI will have a very large impact, possibly in the coming decadeRapid and continuing AI progress is a predictable consequence of the exponential increase in computation used to train AI systems, because research on “scaling laws” demonstrates that more computation leads to general improvements in capabilities. Simple extrapolations suggest AI systems will become far more capable in the next decade, possibly equaling or exceeding human level performance at most intellectual tasks. AI progress might slow or halt, but the evidence suggests it will probably continue. We do not know how to train systems to robustly behave wellSo far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. We are most optimistic about a multi-faceted, empirically-driven approach to AI safety We’re pursuing a variety of research directions with the goal of building reliably safe systems, and are currently most excited about scaling supervision, mechanistic interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. A key goal of ours is to differentially accelerate this safety work, and to develop a profile of safety research that attempts to cover a wide range of scenarios, from those in which safety challenges turn out to be easy to address to those in which creating safe systems is extremely difficult. The full post goes into considerably more detail, and I'm really excited that we're sharing more of our thinking publicly. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 LW - 2022 Survey Results by Screwtape 51:23 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2022 Survey Results, published by Screwtape on March 8, 2023 on LessWrong. The Data 0. Population There were 186 respondents over 28 days. The first week of the survey got 172 responses. The second week got 10. The third and fourth weeks got 2 responses each. Previous surveys have been run over the past decade or so. Their numbers are as follows: 2009: 1662011: 1090 2012: 11952013: 16362014: 1503 2016: 3083 2017: "About 300"2020: 612022: 186 The cheerfully optimistic interpretation of this is something like "cool! I got about as many as Scott did on his first try!" I'll talk about the survey size more in the Conclusion section. For now, on with the numbers. Previous LessWrong Surveys:No: 131, 72.0%Prefer not to answer: 6, 3.3%Yes: 45, 24.7% Did you also take the ACX 2022 Survey?No: 93, 50.5%Prefer not to answer: 3, 1.6%Yes: 88, 47.8% I. Demographics (Not counting sex and gender) There's a format I'll use in a few places throughout this summation. If you see a sequence of numbers like "30.1 + 8.9 (24, 28, 34) [n=186]" those numbers are "Mean + standard deviation (1st quartile, 2nd quartile, 3rd quartile) [n= number responding]." Age: 30.1 + 8.9 (24, 28, 34) [n=186] I broke this out into buckets by decade so I could get a better sense of what was going on. Under 20: 13, 7.0%20 to 29: 91, 49.2%30 to 39: 57, 30.8%40 to 49: 14, 7.6%50 to 59: 8, 4.3%60 to 69: 2, 1.1% The 2009 to 2014 surveys consistently found the mean age was 27, and in 2016 the mean age had only gone up to 28. This is an interesting property for a community to have! It's far from unusual: any survey of a college campus will find the mean age stays about the same, and for different reasons so can the mean age of countries. A survey of Britney Spears fans probably finds that the mean age goes up over time, and many churches observe the same pattern. It's one of several high level ways to look at a community: are you replacing yourself about as fast as you lose people? LessWrong is probably a community people find and age out of. We're a bit older. In 2011, that 20 to 29 bucket was about 60%. Still, teenagers and college students are still finding us. Race:White (non-Hispanic): 148, 82.7%Asian (East Asian): 6. 3.4%Asian (Indian subcontinent): 6, 3.4%Hispanic: 5, 2.8%Middle Eastern: 3, 1.7%Other: 14, 7.8% Country:United States: 95, 51.1%United Kingdom: 20, 10.8%Australia: 12, 6.5%Germany: 11, 5.9%New Zealand: 6, 6.6%France: 6, 6.6% Canada: 5, 5.5%Russia, 4, 4.4%Israel: 4, 4.4%Other: 22 , 24%[The Other category is mostly Eastern European countries with one or two respondents. If there were three or less respondents from your country, I changed your country to Other for the public dataset in case being the one person in your country was too identifying.] II. Sex, Gender, and Relationships Sex:Female: 21, 11.5%Male: 162, 88.5% Gender:F (cisgender): 18, 9.8%F (transgender MtF): 6, 3.3%M (cisgender): 152, 83.1%M (transgender FtM): 0, 0%Other: 7, 3.8% Sexual Orientation:Asexual: 7, 3.9%Bisexual: 24, 13.3%Heterosexual: 145, 80.1%Homosexual: 5, 2.8%Other: 0, 0% Relationship Style:Prefer monogamous: 106, 58.9%Prefer polyamorous: 29, 16.1%Uncertain/no preference: 44, 24.4%Other: 1, 0.6% Number of Partners0: 81, 44.51%1: 90, 49.5%2: 9, 4.9%3: 1, 0.5%4: 0, 0%5: 0, 0%Lots and lots: 1, 0.5%Relationship Goals:. . . and currently not looking for more relationship partners: 119, 67.6%. . . and currently looking for more relationship partners: 57, 32.4%[39.0% of people who don’t have a partner aren’t looking for one. ] To the one person with lots and lots of partners and looking for more, rock on you ambitious fellow. Relationship Status:Married: 40, 21.9%Relationship: 59, 32.2%Single: 84, 45.9%[Two single people have a non-zero number of partners.] Living With:Alone: 41, 23.0%With parents and/or guardians: 29, 16.3%With part... Mar 09, 2023 LW - Speed running everyone through the bad alignement bingo.5k bounty for a LW conversational agent by ArthurB
03:02

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speed running everyone through the bad alignement bingo. 5k bounty for a LW conversational agent, published by ArthurB on March 9, 2023 on LessWrong. There's a wave of people, of various degrees of knowledge and influence, currently waking up to the ideas of AI existential risk. They seem to be literally going through every box of the bad alignement bingo card takes. I think there is value in educating those people. I'm aware there's an argument to be made that: education at scale doesn't matter, coordination is too difficult, all that matter is solving alignment and that takes care of the rest. There's something to that, but I disagree that education at scale doesn't help. It can make progress of frontrunners marginally more safety oriented, it can steer company cultures, it can move the Overton window, change the Zeitgeist, it can buy a bit of time. You likely didn't stumble on these ideas all on your own, so arguing against the value of outreach or education is also arguing against your own ability to do anything. It's also a matter of ROI, and there are some very low hanging fruit there. The simplest thing would be to write a long FAQ that goes through every common objections. No, people won't read the whole sequences, or Arbital on their own, but they might go through a FAQ. But we can do better than a FAQ. It's now fairly straightforward, with tools like langchain () to turn a set of documents into a body of knowledge for a conversational agent. This is done by building an index of embedding that a language model can search to bring context to an answer. This doesn't preclude fine tuning, but it makes it unnecessary. So a straightforward project is to index lesswrong, index arbitral, index the alignment forum, maybe index good alignement papers as well, blog posts, books. Then hook that up to the ChatGPT API, and prompt it to: list search queries for relevant material to answer the question compose an answer that reflects the content and opinion of the data answer with infinite patience Some jailbreak prompts may be needed to prevent ChatGPT's conditioning to regurgitate AI risk appeasing propaganda through the API, but there are a bunch of those out there. Or use the API of other models as they become open source or commercially available. Will this save humanity? No. Will this turn the course of safety research? Also no. Is this using AI to advance alignment? Well, yes, a little bit, don't dismiss very small starts. Is this worth spending a weekend hacking on this project instead of posting on Twitter? Absolutely. Will this actually make things worse? No, you're overthinking this. I'll pay5k to the best version built by the end of March (if any is built). It's a modest bounty but it's really not all that much work, and it's fun work. And of course if anyone wants to add their own contribution to the bounty please do. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 09, 2023
LW - How bad a future do ML researchers expect? by AI Impacts
03:17

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How bad a future do ML researchers expect?, published by AI Impacts on March 9, 2023 on LessWrong. Katja Grace, 8 March 2023 In our survey last year, we asked publishing machine learning researchers how they would divide probability over the future impacts of high-level machine intelligence between five buckets ranging from ‘extremely good (e.g. rapid growth in human flourishing)’ to ‘extremely bad (e.g. human extinction).1 The median respondent put 5% on the worst bucket. But what does the whole distribution look like? Here is every person’s answer, lined up in order of probability on that worst bucket: And here’s basically that again from the 2016 survey (though it looks like sorted slightly differently when optimism was equal), so you can see how things have changed: The most notable change to me is the new big black bar of doom at the end: people who think extremely bad outcomes are at least 50% have gone from 3% of the population to 9% in six years. Here are the overall areas dedicated to different scenarios in the 2022 graph (equivalent to averages): Extremely good: 24% On balance good: 26% More or less neutral: 18% On balance bad: 17% Extremely bad: 14% That is, between them, these researchers put 31% of their credence on AI making the world markedly worse. Some things to keep in mind in looking at these: If you hear ‘median 5%’ thrown around, that refers to how the researcher right in the middle of the opinion spectrum thinks there’s a 5% chance of extremely bad outcomes. (It does not mean, ‘about 5% of people expect extremely bad outcomes’, which would be much less alarming.) Nearly half of people are at ten percent or more. The question illustrated above doesn’t ask about human extinction specifically, so you might wonder if ‘extremely bad’ includes a lot of scenarios less bad than human extinction. To check, we added two more questions in 2022 explicitly about ‘human extinction or similarly permanent and severe disempowerment of the human species’. For these, the median researcher also gave 5% and 10% answers. So my guess is that a lot of the extremely bad bucket in this question is pointing at human extinction levels of disaster. You might wonder whether the respondents were selected for being worried about AI risk. We tried to mitigate that possibility by usually offering money for completing the survey (50 for those in the final round, after some experimentation), and describing the topic in very broad terms in the invitation (e.g. not mentioning AI risk). Last survey we checked in more detail—see ‘Was our sample representative?’ in the paper on the 2016 survey. Here’s the 2022 data again, but ordered by overall optimism-to-pessimism rather than probability of extremely bad outcomes specifically: For more survey takeaways, see this blog post. For all the data we have put up on it so far, see this page. See here for more details. Thanks to Harlan Stewart for helping make these 2022 figures, Zach Stein-Perlman for generally getting this data in order, and Nathan Young for pointing out that figures like this would be good. Notes Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 09, 2023 LW - Against LLM Reductionism by Erich Grunewald 30:58 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against LLM Reductionism, published by Erich Grunewald on March 8, 2023 on LessWrong. Summary Large language models (henceforth, LLMs) are sometimes said to be "just" shallow pattern matchers, "just" massive look-up tables or "just" autocomplete engines. These comparisons amount to a form of (methodological) reductionism. While there's some truth to them, I think they smuggle in corollaries that are either false or at least not obviously true. For example, they seem to imply that what LLMs are doing amounts merely to rote memorisation and/or clever parlour tricks, and that they cannot generalise to out-of-distribution data. In fact, there's empirical evidence that suggests that LLMs can learn general algorithms and can contain and use representations of the world similar to those we use. They also seem to suggest that LLMs merely optimise for success on next-token prediction. It's true that LLMs are (mostly) trained on next-token prediction, and it's true that this profoundly shapes their output, but we don't know whether this is how they actually function. We also don't know what sorts of advanced capabilities can or cannot arise when you train on next-token prediction. So there's reason to be cautious when thinking about LLMs. In particular, I think, caution should be exercised (1) when making predictions about what LLMs will or will not in future be capable of and (2) when assuming that such-and-such a thing must or cannot possibly happen inside an LLM. Pattern Matchers, Look-up Tables, Stochastic Parrots My understanding of what goes on inside machine learning (henceforth, ML) models, and LLMs in particular, is still in many ways rudimentary, but it seems clear enough that, however tempting that is to imagine, it's little like what goes on in the minds of humans; it's weirder than that, more alien, more eldritch. As LLMs have been scaled up, and more compute and data have been poured into models with more parameters, they have undergone qualitative shifts, and are now capable of a range of tasks their predecessors couldn't even grasp, let alone fail at, even as they have retained essentially the same architecture and training process.[1] How do you square their awesome, if erratic, brilliance with the awareness that their inner workings are so ordinary? One route would be to directly deny the brilliance. Gary Marcus does this, pointing out, and relishing in, the myriad ways that LLMs misfire. Their main limits are, he says, that they are unreliable and untruthful. (See the footnote for my thoughts on that.[2]) That's one route, but it's not the one I want to discuss here. The route I want to discuss here is to dispel the magic, so to speak: to argue that what goes on inside LLMs is "shallow", and that LLMs lack "understanding". This often takes the form of asserting that LLMs are just doing pattern matching[3], or just rephrasing material from the web[4], amounting to mere stochastic parrots[5], or just retrieving things from a massive look-up table. Gary Marcus describes the underlying problem as one of "a lack of cognitive models of the world": The improvements, such as they are, come primarily because the newer models have larger and larger sets of data about how human beings use word sequences, and bigger word sequences are certainly helpful for pattern matching machines. But they still don't convey genuine comprehension, and so they are still very easy [...] to break. Well -- in a certain light and for the sake of fairness -- this view is not entirely wrong: LLMs are, in a sense, pattern matching. They likely have a great deal of attention heads and neurons and whatever that detect certain patterns in the input, which then help determine the model's output. LLMs are, in a sense, merely rephrasing material from the web. All, or nearly all, of the data that th... Mar 08, 2023 LW - The Kids are Not Okay by Zvi 49:19 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Kids are Not Okay, published by Zvi on March 8, 2023 on LessWrong. It has been a subject of much recent internet discourse that the kids are not okay. By all reports, the kids very much seem to be not all right. Suicide attempts are up. Depressive episodes are way up. The general vibes and zeitgeist one gets (or at least that I get) from young people are super negative. From what I can tell, they see a world continuously getting worse along numerous fronts, without an ability to imagine a positive future for the world, and without much hope for a positive future for themselves. Should we blame the climate? Should we blame the phones? Or a mind virus turning them to drones? Heck, no! Or at least, not so fast. Let’s first lay out the evidence and the suspects.1 Then, actually, yes. Spoiler alert, I’m going to blame the phones and social media. After that, I’ll briefly discuss what might be done about it. Suicide Rates The suicide numbers alone would seem at first to make it very very clear how not all right the kids are. Washington Post reports, in an exercise in bounded distrust: Nearly 1 in 3 high school girls reported in 2021 that they seriously considered suicide — up nearly 60 percent from a decade ago — according to new findings from the Centers for Disease Control and Prevention. Almost 15 percent of teen girls said they were forced to have sex, an increase of 27 percent over two years and the first increase since the CDC began tracking it. Thirteen percent [of girls] had attempted suicide during the past year, compared to 7 percent of boys. One child in ten attempted suicide this past year, and it is steadily increasing? Yikes. There is a big gender gap here, but as many of you already suspect because the pattern is not new, it is not what you would think from the above. In the U.S, male adolescents die by suicide at a rate five times greater than that of female adolescents, although suicide attempts by females are three times as frequent as those by males. A possible reason for this is the method of attempted suicide for males is typically that of firearm use, with a 78–90% chance of fatality. Females are more likely to try a different method, such as ingesting poison.[8] Females have more parasuicides. This includes using different methods, such as drug overdose, which are usually less effective. I am going to go ahead and say that if males die five times as often from suicide, that seems more important than the number of attempts. It is kind of stunning, or at least it should be, to have five boys die for every girl that dies, and for newspapers and experts to make it sound like girls have it worse here. Very big ‘women have always been the primary victims of war. Women lose their husbands, their fathers, their sons in combat’ (actual 1998 quote from Hillary Clinton) energy. The conflation of suicide rates with forced sex here seems at best highly misleading. The sexual frequency number is rather obviously a reflection of two years where people were doing rather a lot of social distancing. With the end of that, essentially anything social is going to go up in frequency, whether it is good, bad or horrifying – only a 27 percent increase seems well within the range one would expect from that. Given all the other trends in the world, it would be very surprising to me if the rates of girls being subjected to forced sex (for any plausible fixed definition of that) were not continuing to decline. That implies that in the past, things on such fronts were no-good, horribly terrible, and most of it remained hidden. I do indeed believe exactly this. Also, can we zoom out a bit? On a historical graph, the suicide rate does not look all that high (scale is suicides per 100,000 children, per year)? The kids are not okay. The kids in the 1990s were, by some of these g... Mar 08, 2023 LW - AI Safety in a World of Vulnerable Machine Learning Systems by AdamGleave 52:31 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety in a World of Vulnerable Machine Learning Systems, published by AdamGleave on March 8, 2023 on LessWrong. Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable. Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default). In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk. Alignment AgendaMain System’s Ability to Exploit HelperMain System’s Incentive to Exploit HelperRisk of ExploitRL on learned reward model (e.g. RLHF, IRL)MediumHighHighScalable oversight (e.g. recursive reward modeling,AI safety via debate)MediumHighHighImitation learning (e.g. behavioral cloning, supervised fine-tuning)MediumLowLow-MediumImitative Iterated Distillation and AmplificationHighLowMediumAuditing Tool (e.g. Adversarial Testing, Transparency)LowMediumLow-Medium Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed. We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another). We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)[1] but with only modest progress (low to medium tractability). In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by. Alignment agendas need robustness Most alignment schemes implicitl... Mar 08, 2023 LW - A Brief Defense of Athleticism by Wofsen 01:44 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Brief Defense of Athleticism, published by Wofsen on March 7, 2023 on LessWrong. Non-athletic thinkers are myopic. I see at least three very important reasons any rationalist must value exercise. As someone who has trained extremely hard in distance running, sports in general and athletic conditioning in particular create an intuitive understanding of the fallaciousness of Cartesian Dualism and the accuracy of materialism in the sense that the mind is merely a part of the body. Physical challenges also force one to understand the limitations of Kahneman's "system two." For example, one may know what it means to not start a race too quickly and then "die" (running jargon for running out of steam prematurely), but repeated failures in actual races teach one to realize the limitations of one's rationality, especially under stress (even if distance running is classified as eustress rather than distress; remember, cortisol levels & arousal are the same in either case). Conditioning has been shown to increase cognitive performance. Results of a quick google finds: This study notes better reaction time in fit adolescents (no causal link, n = 30). This review shows decreased "system 2" aging in physically fit elderly individuals (causality likely, preponderance of evidence). Acute exercise (<1 hour) has also been shown to cause immediate improvements in decision making. I recommend a deep dive into the mechanistic research into the causes of this for structuralists. But any Bayesian looking to be more rational ought to make an update and start working out. (Edit: Grammar) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 08, 2023 LW - [Linkpost] Talk on DeepMind alignment strategy by Vika 01:23 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] Talk on DeepMind alignment strategy, published by Vika on March 7, 2023 on LessWrong. I recently gave a talk about DeepMind's alignment strategy at the SERI MATS seminar, sharing the slides here for anyone interested. This is an overview of our threat models, our high-level current plan, and how current projects fit into this plan. Disclaimer: this talk represents the views of the alignment team and is not officially endorsed by DeepMind. Our high level approach to alignment is to try to direct the training process towards aligned AI and away from misaligned AI. To illustrate this, imagine we have a space of possible models, where the red areas consist of misaligned models that are highly competent and cause catastrophic harm, and the blue areas consist of aligned models that are highly competent and don't cause catastrophic harm. The training process moves through this space and by default ends up in a red area consisting of misaligned models. We aim to identify some key point on this path, for example a point where deception was rewarded, and apply some alignment technique that directs the training process to a blue area of aligned models instead. Check out the slides for more details! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 07, 2023 LW - What‘s in your list of unsolved problems in AI alignment? by jacquesthibs 01:09 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What‘s in your list of unsolved problems in AI alignment?, published by jacquesthibs on March 7, 2023 on LessWrong. Question for my fellow alignment researchers out there, do you have a list of unsolved problems in AI alignment? I'm thinking of creating an "alignment mosaic" of the questions we need to resolve and slowly filling it in with insights from papers/posts. I have my own version of this, but I would love to combine it with others' alignment backcasting game-trees. I want to collect the kinds of questions people are keeping in mind when reading papers/posts, thinking about alignment or running experiments. I'm working with others to make this into a collaborative effort. Ultimately, what I’m looking for are important questions and sub-questions we need to be thinking about and updating on when we read papers and posts always as when we decide what to read. Here’s my Twitter thread posing this question:. Here’s a sub-thread breaking down the alignment problem in various forms:. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 07, 2023 LW - Google's PaLM-E: An Embodied Multimodal Language Model by SandXbox 01:44 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Google's PaLM-E: An Embodied Multimodal Language Model, published by SandXbox on March 7, 2023 on LessWrong. Abstract: Large language models have been demonstrated to perform complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks, including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 07, 2023 LW - Against ubiquitous alignment taxes by beren 03:26 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Against ubiquitous alignment taxes, published by beren on March 6, 2023 on LessWrong. Crossposted from my personal blog. It is often argued that any alignment technique that works primarily by constraining the capabilities of an AI system to be within some bounds cannot work because it imposes too high an 'alignment tax' on the ML system. The argument is that people will either refuse to apply any method that has an alignment tax, or else they will be outcompeted by those who do. I think that this argument is applied too liberally and often without consideration for several key points: 1.) 'Capabilities' is not always a dial with two settings 'more' and 'less'. Capabilities are highly multifaceted and certain aspects of capabilities can be taxed or constrained without affecting others. Often, it is precisely these constraints that make the AI system economically valuable in the first place. We have seen this story play out very recently with language models where techniques that strongly constrain capabilities such as instruct finetuning and RLHF are, in fact, what create the economic value. Base LLMs are pretty much useless in practice for most economic tasks, and RLHFd and finetuned LLMs are much more useful even though the universe of text that they can generate has been massively constrained. It just so happens that the constrained universe has a mnuch greater proportion of useful text than the unconstrained universe of the base LLM. People are often, rationally, very willing to trade off capability and generalizability for reliability in practice. 2.) 'Capabilities' are not always good from our perspective economically. Many AGI doom scenarios require behaviour and planning that would be extremely far from what there would be essentially any economic value to any current actors for doing. As an extreme case, the classic paperclipper scenario typically arises because the model calculates that if it kills all humans it gets to tile the universe with paperclips in billions of years. Effectively, it Pascal's mugs itself over the dream of universal paperclips. Having an AGI that can plan billions of years in the future is valuable to nobody today compared to one with a much, much, shorter planning horizon. Constraining this 'capability' has an essentially negligible alignment tax. 3.) Small alignment taxes being intolerable is an efficient market argument and the near-term AGI market is likely to be extremely inefficient. Specifically, it appears likely to be dominated by a few relatively conservative tech behemoths. The current brewing arms race between Google and Microsoft/OpenAI is bad for this but notably this is the transition from there being literally no competition to any competition at all. Economic history also shows us that the typical results of setups like this is that the arms race will quickly defuse into a cosy and slow oligopoly. Even now there is still apparently huge slack. OpenAI have almost certaintly been sitting on GPT4 for many months before partially releasing it as Bing. Google have many many unreleased large language models including almost certainly SOTA ones. 4.) Alignment taxes can (and should) be mandated by governments. Having regulations slow development and force safety protocols to be implemented is not a radical proposal and is in fact the case in many other industries where it can completely throttle progress (i.e. nuclear with much less reason for concern). This should clearly be a focus for policy efforts. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Mar 07, 2023 LW - Who Aligns the Alignment Researchers? by Ben Smith 19:23 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Who Aligns the Alignment Researchers?, published by Ben Smith on March 5, 2023 on LessWrong. There may be an incentives problem for AI researchers and research organizations who face a choice between researching Capabilities, Alignment, or neither. The incentives structure will lead individuals and organizations to work towards Capabilities work rather than Alignment. The incentives problem is a lot clearer at the organizational level than the individual level, but bears considering at both levels, and of course, funding available to organizations has downstream implications for the jobs available for researchers employed to work on Alignment or Capabilities. In this post, I’ll describe a couple of key moments in the history of AI organizations. I’ll then survey incentives researchers might have for doing either Alignment work or Capabilities work. We’ll see that it maybe that, even considering normal levels of altruism, the average person might prefer to do Capabilities rather than Alignment work. There is relevant collective action dynamic. I’ll then survey the organizational level and global level. After that, I’ll finish by looking very briefly at why investment in Alignment might be worthwhile. A note on the dichotomous framing of this essay: I understand that the line between Capabilities and Alignment work is blurry, or worse, some Capabilities work plausibly advances Alignment, and some Alignment work advances Capabilities, at least in the short term. However, in order to model the lay of the land, it’s helpful as a simplifying assumption to examine Capabilities and Alignment as distinct fields of research and try to understand the motivations for researchers in each. History As a historical matter, DeepMind and OpenAI were both founded with explicit missions to create safe, Aligned AI for the benefit of all humanity. There are different views on the extent to which each of these organizations remains aligned to that mission. Some people maintain they are, while others maintain they are doing incredible harm by shortening AI timelines. No one can deny that they have moved at least somewhat in the direction of more profit-making behavior, and are very much focused on Capabilities research. So, at best, they’ve stuck to their original mission, but having watered it down to allow a certain amount of profit-seeking; at worst, their overall efforts are net-negative for alignment by accelerating development of AGI. OpenAI took investment from Microsoft in January, to the tune of10b. At the time, they said This multi-year, multi-billion dollar investment from Microsoft follows their previous investments in 2019 and 2021, and will allow us to continue our independent research and develop AI that is increasingly safe, useful, and powerful. And this seems plausibly like a systemic pressure other AI Capabilities researchers will face, too. Because of the disparate capital available, in order to fund research in AI Safety, any AI research organization will be incentivized to do capabilities research. On the other hand, it’s striking that no organizations founded with the goal of AI Capabilities research have drifted towards Alignment research over time. Organizations under this category might include John Carmack’s recent start-up, Keen Technologies, Alphabet, and many other organizations. Systemically, this can be explained by the rules of the capitalist environment organizations work within. If you create a company to do for-profit work, and get investors to invest in the project, they’ll expect a return. If you go public, you’ll have a fiduciary duty to obtain a return for investors. For organizations, Alignment doesn’t earn money (except in so far as it improves capabilities for tasks); Capabilities does. As the amount of money available to investors grows, more an...
Mar 06, 2023
LW - Introducing Leap Labs, an AI interpretability startup by Jessica Rumbelow
02:49

Mar 06, 2023
LW - Do humans derive values from fictitious imputed coherence? by TsviBT
24:10

Mar 06, 2023
LW - The Practitioner's Path 2.0: A new framework for structured self-improvement by Evenflair
17:41

Mar 06, 2023
LW - Why did you trash the old HPMOR.com? by AnnoyedReader
02:49

Mar 06, 2023
LW - 🤔 Coordination explosion before intelligence explosion...? by tailcalled
03:06

Mar 06, 2023
LW - Does polyamory at a workplace turn nepotism up to eleven? by Viliam
03:07

Mar 05, 2023
LW - Selective, Corrective, Structural: Three Ways of Making Social Systems Work by Said Achmiz
03:28

Mar 05, 2023
LW - Contra "Strong Coherence" by DragonGod
07:03

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contra "Strong Coherence", published by DragonGod on March 4, 2023 on LessWrong. Polished from my shortform See also: Is "Strong Coherence" Anti-Natural? Introduction Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).Such strong coherence is not represented in humans (or any other animal), seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe. I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error.I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems, because powerful systems in the real world are "of an importantly different type". Ontological Error? I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD. HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects. The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that (immutable) terminal goals is just a wrong frame for reasoning about generally intelligent systems in our world (e.g. humans, animals and future powerful AI systems). Theoretical Justification and Empirical Investigation Needed I'd be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.And empirical demonstrations of systems that actually become more coherent as they are trained for longer/"scaled up" or otherwise amplified. I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals, humans) or sophisticated ML systems (e.g. foundation models) aren't strongly coherent.And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar, OpenAI Five [or replications thereof]) to investigate their degree of coherence. Conclusions Currently, I think strong coherence is unlikely (plausibly "anti-natural") and am unenthusiastic about research agendas and threat models predicated on strong coherence. Disclaimer The above is all low confidence speculation, and I may well be speaking out of my ass. By "strong coherence/goal directedness" I mean something like: Informally: a system has immutable terminal goals. Semi-formally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states. You cannot well predict the behaviour/revealed preferences of humans or other animals by the assumption that they have immutable terminal goals or are expected utility maximisers. The ontology that intelligent systems in the real world instead have "values" (contextual influences on decision making) seems to explain their observed behaviour (and purported "incoherencies") better. Many observed values in humans and other mammals (see) (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for increasing inclusive genetic fitness (promoting survival, exploration, cooperation and sexual reproduction/survival of progeny respectively). Yet, humans and mammals seem to value these terminally and not because of their instrumental value on inclusive genetic fitness. That the instrumentally convergent goals of evolution's fitness criterion manifested as "terminal" values in mammals is IMO strong empiric...
Mar 05, 2023
LW - Why Not Just... Build Weak AI Tools For AI Alignment Research? by johnswentworth
11:36

Mar 05, 2023
LW - Contra Hanson on AI Risk by Liron
13:32

Mar 04, 2023
LW - AI Governance and Strategy: Priorities, talent gaps, and opportunities by Akash
07:22

Mar 04, 2023
LW - How popular is ChatGPT? Part 2: slower growth than Pokémon GO by Richard Korzekwa
08:02

Mar 04, 2023
LW - state of my alignment research, and what needs work by carado
03:59

Mar 04, 2023
LW - Acausal normalcy by Andrew Critch
12:24

Mar 04, 2023
LW - Sydney can play chess and kind of keep track of the board state by Erik Jenner
09:33

Mar 03, 2023
LW - Robin Hanson’s latest AI risk position statement by Liron
01:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robin Hanson’s latest AI risk position statement, published by Liron on March 3, 2023 on LessWrong. “While I’ve written on this many times before, it seems time to restate my position.” “While I agree that this is a logically possible scenario, not excluded by what we know, I am disappointed to see so many giving it such a high credence, given how crazy far it seems from our prior experience. Yes, there is a sense in which the human, farming, and industry revolutions were each likely the result of a single underlying innovation. But those were the three biggest innovations in all of human history. And large parts of the relevant prior world exploded together in those cases, not one tiny part suddenly exterminating all the rest. In addition, the roughly decade duration predicted from prior trends for the length of the next transition period seems plenty of time for today’s standard big computer system testing practices to notice alignment issues. And note that the impressive recent AI chatbots are especially unlike the systems of concern here: self-improving very-broadly-able full-agents with hidden intentions. Making this an especially odd time to complain that new AI systems might have killed us all.” Seems not much has changed in the Yudkowsky vs. Hanson position over the years, i.e. still assigning high vs. low existential risk. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Mar 03, 2023
LW - The Waluigi Effect (mega-post) by Cleo Nardo
26:00

Mar 03, 2023
LW - Why I’m not into the Free Energy Principle by Steven Byrnes
14:13

Mar 02, 2023
LW - Learn the mathematical structure, not the conceptual structure by Adam Shai
03:29

Mar 02, 2023
LW - The Parable of the King and the Random Process by moridinamael
09:15

Mar 02, 2023
LW - Sunlight is yellow parallel rays plus blue isotropic light by Thomas Kehrenberg
02:59

Mar 01, 2023
LW - Implied "utilities" of simulators are broad, dense, and shallow by porby
06:45

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on LessWrong. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply to inner lea...
Mar 01, 2023
LW - AI: Practical Advice for the Worried by Zvi
22:22

Mar 01, 2023
LW - Problems of people new to AI safety and my project ideas to mitigate them by Igor Ivanov
11:10

Mar 01, 2023
LW - reflections on lockdown, two years out by mingyuan
04:08

Mar 01, 2023
LW - Predictions for shard theory mechanistic interpretability results by TurnTrout
09:26

Mar 01, 2023
LW - Contract Fraud by jefftk
01:30

Mar 01, 2023
LW - Inside the mind of a superhuman Go model: How does Leela Zero read ladders? by Haoxing Du
51:13

Mar 01, 2023
LW - A mostly critical review of infra-Bayesianism by matolcsid
43:06

Mar 01, 2023
LW - My Experience With Loving Kindness Meditation by maia
04:46

Feb 28, 2023
LW - Transcript: Yudkowsky on Bankless follow-up QandA by vonk
42:20

Feb 28, 2023
LW - $20 Million in NSF Grants for Safety Research by Dan H 01:24 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is:$20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on LessWrong. After a year of negotiation, the NSF has announced a 20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 28, 2023 LW - Fertility Rate Roundup #1 by Zvi 17:46 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fertility Rate Roundup #1, published by Zvi on February 27, 2023 on LessWrong. Previously: On Car Seats as Contraception [Editor’s Note: This post assumes the perspective that more people having more children is good, actually. I will not be engaging with any of the arguments against this, of any quality, whether they be ‘AI or climate change is going to kill everyone’ or ‘people are bad actually,’ other than to state here that I strongly disagree. AI content will continue later this week.] A common theme in childhood roundups has been that existing efforts by governments, to increase the number of children born in various countries, have all been pathetically small in magnitude. The amounts spent and methods used pale in comparison to what is at stake. We reliably see signs that the policies work, even when poorly designed and implemented – the same way that when policies attempt to reduce the birth rate, those work as well. The core problem is the dose is too low. Yes. If you give parents money, more people choose to be parents.And the amount necessary to make this happen is, if you crunch the numbers, an amount rapidly aging societies can’t afford not to pay. The other theme is, as I discuss in On Car Seats as Contraception, that there are lots of other government policies that have much bigger impacts on the felt costs of having and raising children, including the lifestyle and status impacts of raising children. This is a roundup of related efforts that have crossed my desk recently, to illustrate that this is a highly solvable problem. Childcare Child care in America continues to be super expensive. People who understand economics understand that this is true because we combine large purchasing subsidies with onerous baseline requirements that drive up costs. Whereas you could (at least partly) solve this problem in the style of Vermont, by doing much less of both these things – removing price barriers for the bottom half and removing subsidizes for at least the top half, instead Giving Parents Money mostly in the form of lower taxes. I also continue to think that it is madness to subsidize and massively favor professionally provided child care over family provided child care, where as far as I can tell most everyone agrees children are better off with family provided child care, which we are now essentially pricing out of the market. While in other contexts, of course, massively favoring family provided care via the tax code. Alternatively, you could do what DC does (direct). Requiring a college degree to provide child care is one of those places I fail the ITT. Meanwhile Department of State proposes gutting the Au Pair program. They of course refer to this as things like ‘strengthening the educational component.’ By all reports I’ve seen the au pair program is insanely great as it is except it isn’t large enough. Win for everyone involved. If prospective parents confidently knew they could participate in the Au Pair program, I would predict a substantial increase in the birth rate. This could be a full plan. Houses or apartments and locations could be chosen with this in mind, and life could be much easier to manage and predict. A study in Finland finds that paying moms to stay home results in them staying home substantially more and working substantially less, including in the long term, whereas subsidizing child care and thus work (and by extension effectively taxing staying at home) has the opposite effect (although the paper’s results seem to not technically be ‘significant’, so salt as needed). You don’t say. You can either prefer to have mothers stay home or prefer to have mothers work, and people respond to incentives. You can get whatever change you want if you care enough. You do need to choose. Parental Leave At People’s Policy Project, Matt Bruenig... Feb 27, 2023 LW - Respect Chesterton-Schelling Fences by shminux 01:44 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Respect Chesterton-Schelling Fences, published by shminux on February 27, 2023 on LessWrong. This post is inspired by the recent Ziz-revelations posted here and elsewhere. Chesterton Fence injunction: Do not remove a protective barrier unless you know why it was put there in the first place. Schelling fence injunction: Do not cross a generally agreed upon guardrail, lest you end up sliding down an increasingly slippery slope, not noticing until it is too late. I think a term like a Chesterton-Schelling Fence injunction might be useful: Respect an ethical injunction even if you think you know why it was put there in the first place. A somewhat simplified example: There is a rather strong Schelling fence against, say, killing someone. Suppose the stated reasoning behind it is "God commanded so". Some day, you deconvert and start questioning the tenets of your faith, throwing one injunction after another, assuming you know why it was there, not realizing that this particular Chesterton fence is fake, the real reason is an unstated Schelling fence that has little to do with religion, but a lot with living in a society. I said "respect" not "obey", because it is often hard to tell whether there is a hidden Schelling fence behind a Chesterton fence, and how strong the former is. Or vice versa. Or how many of the various hidden fences are there. Is it okay to cheat in an unhappy marriage? Maybe, maybe not, but noticing that this is an unsafe territory, that respecting the societal norms is generally a safe default, and that crossing it is likely yo backfire in both expected and unexpected ways can be quite useful. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 27, 2023 LW - Learning How to Learn (And 20+ Studies) by maxa 11:28 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Learning How to Learn (And 20+ Studies), published by maxa on February 26, 2023 on LessWrong. I have been interested in cognitive science and effective learning methods for years. I've read multiple books and articles and put many ideas to test. In this post, I aim to synthesize my notes and provide references to scientific studies. Summary (TL;DR) Effective Learning Strategies Distributed learning. Study less in each session but more frequently. Active recall. Actively test your knowledge and skills. Distributed recall. Space the tests in time and adjust the intervals based on performance. Interleaving. Practice multiple related yet different skills/concepts simultaneously. Elaborative interrogation (quiz-and-recall). Ask yourself questions and use the material you've learned to answer them. Self-explanation and the Feynman technique. Explain what you've just learned in simple terms. Physiology and Brain's Health Sleep Exercise Nutrition Disclaimer and Introduction I have no formal background in cognitive science or neuroscience, and this has been more of a side interest. My understanding is limited, and I still need to learn how to effectively and consistently apply all these ideas to practice. That being said, I found some of the methods described in this article very useful. For example, I've used them to learn foreign languages, the basics of programming, and various disciplines covered during the two-year MBA program. Effective Learning Strategies Strategy #1: Distributed (Spaced) Learning Practice In short, it's better to distribute one's practice over a period of time than cram it into one day. In one study, elementary school students were asked to study in one of the three ways: massed, clumped, and spaced. Massed = four lessons at a time Clumped = two lessons on one day and two lessons on the next day Spaced = one lesson per day for four days The “spaced” group performed best, followed by the “clumped” group: Another study compared comprehension scores under three different conditions: Read a text once (“single”) Read a text twice (“massed”) Read a text twice with a week-long gap (“distributed”) When tested immediately, the second group performed best. But when tested with a delay of two days, the third group performed best. This method is also superior for learning motor skills. How to apply this in practice: Create a learning schedule or find time to practice a little bit every day or every few days instead of cramming all your learning into one or just a few days. If you’d like to learn more, read the Wikipedia article on distributed practice. Strategy #2: Active Recall (Retrieval) Practice It might be more effective to actively retrieve the information you’ve already learned than passively re-read or try to learn it once again. One study that compared a method that emphasized study sessions with a method that emphasized tests and found the latter to be more effective for delayed recall. SSSS = four study sessions SSST = three study sessions, followed by one test STTT = one study session, followed by three tests Even imagining that you might be tested on the material you’re learning might help improve the recall. How to apply this in practice: If a few days ago you learned how past tense works in the Spanish language, try to remember the rules or even test yourself on your knowledge — instead of simply re-reading the same material once again. You can read more about the active recall practice on Wikipedia. Strategy #3: Distributed (Spaced) Recall Practice Distributed recall practice is basically a combination of the two ideas above. You test yourself frequently and modify the test intervals depending on how familiar you’re with the material or how strong your skill is. How to apply this in practice: Many apps simplify the process by tracking one’s perfor... Feb 27, 2023 LW - The Preference Fulfillment Hypothesis by Kaj Sotala 16:24 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Preference Fulfillment Hypothesis, published by Kaj Sotala on February 26, 2023 on LessWrong. Short version Humans have an innate motivation ("preference fulfillment", PF) to fulfill the preferences of those they care about. It corresponds to at least some of the senses of the word "love", as well as related words such as "kindness" and "compassion". I hypothesize that it works by simulating the other person and predicting what they would want or how they would like to be treated. PF is when you take your simulation of what other people would want and add an extra component that makes you intrinsically value outcomes that your simulation predicts the other people would prefer. I also hypothesize that this is the same kind of simulation that forms our ability to work as a social species in the first place. A mental simulation process is active in virtually every situation where we interact with other people, such as in a grocery store. People use masks/roles/simulations to determine the right behavior in any social situation, running simulations of how others would react to various behaviors. These simulations involve both the actual people present in the situation as well as various other people whose opinions we've internalized and care about. The simulations generally allow people to engage in interactions by acting the way a normal person would in a given situation. Once you have this kind of a simulation, constantly running in basically any social situation, it’s likely already exhibiting the PF drive to a weak degree. Doing things that we expect to fulfill other people’s preferences often feels intrinsically nice, even if the person in question was a total stranger. So does wordless coordination in general, as evidenced by the popularity of things like dance. If this is true, capabilities progress may then be closely linked to alignment progress. Getting AIs to be better at following instructions requires them to simulate humans better. Once you have an AI that can simulate human preferences, you already have most of the machinery required for having PF as an intrinsic drive. This is contrary to the position that niceness is unnatural. The preference fulfillment hypothesis is that niceness/PF is a natural kind that will be relatively easy to get out of any AI smart enough to understand what humans want it to do. This implies that constructing aligned AIs might be reasonably easy, in the sense that most of the work necessary for it will be a natural part of progress in capabilities. Long version The preference fulfillment hypothesis Imagine someone who you genuinely care about. You probably have some kind of a desire to fulfill their preferences in the kind of way that they would like their preferences to be fulfilled. It might be very simple ("I like chocolate but they like vanilla, so I would prefer for them to get vanilla ice cream even when I prefer chocolate"), but it might get deep into pretty fundamental differences in preferences and values ("I'm deeply monogamous and me ever being anything else would go against my sacred value, but clearly non-monogamy is what works for my friend and makes them happy so I want them to continue living that way"). It's not necessarily absolute - some things you might still find really upsetting and you'd still want to override the other person’s preferences in some cases - but you can at least feel the "I want them to satisfy their preferences the way they themselves would like their preferences to be satisfied" thing to some extent. I think this kind of desire is something like its own distinct motivation in the human mind. It can easily be suppressed by other kinds of motivations kicking in - e.g. if the other person getting what they wanted made you feel jealous or insecure, or if their preferences involved actively... Feb 26, 2023 LW - "Rationalist Discourse" Is Like "Physicist Motors" by Zack M Davis 15:12 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Rationalist Discourse" Is Like "Physicist Motors", published by Zack M Davis on February 26, 2023 on LessWrong. Imagine being a student of physics, and coming across a blog post proposing a list of guidelines for "physicist motors"—motor designs informed by the knowledge of physicists, unlike ordinary motors. Even if most of the things on the list seemed like sensible advice to keep in mind when designing a motor, the framing would seem very odd. The laws of physics describe how energy can be converted into work. To the extent that any motor accomplishes anything, it happens within the laws of physics. There are theoretical ideals describing how motors need to work in principle, like the Carnot engine, but you can't actually build an ideal Carnot engine; real-world electric motors or diesel motors or jet engines all have their own idiosyncratic lore depending on the application and the materials at hand; an engineer who worked on one, might not the be best person to work on another. You might appeal to principles of physics to explain why some particular motor is inefficient or poorly-designed, but you would not speak of physicist motors as if that were a distinct category of thing—and if someone did, you might quietly begin to doubt how much they really knew about physics. As a student of rationality, I feel the same way about guidelines for "rationalist discourse." The laws of probability and decision theory describe how information can be converted into optimization power. To the extent that any discourse accomplishes anything, it happens within the laws of rationality. Rob Bensinger proposes "Elements of Rationalist Discourse" as a companion to Duncan Sabien's earlier "Basics of Rationalist Discourse". Most of the things on both lists are, indeed, sensible advice that one might do well to keep in mind when arguing with people, but as Bensinger notes, "Probably this new version also won't match 'the basics' as other people perceive them." But there's a reason for that: a list of guidelines has the wrong type signature for being "the basics". The actual basics are the principles of rationality one would appeal to explain which guidelines are a good idea: principles like how evidence is the systematic correlation between possible states of your observations and possible states of reality, how you need evidence to locate the correct hypothesis in the space of possibilities, how the quality of your conclusion can only be improved by arguments that have the power to change that conclusion. Contemplating these basics, it should be clear that there's just not going to be anything like a unique style of "rationalist discourse", any more than there is a unique "physicist motor." There are theoretical ideals describing how discourse needs to work in principle, like Bayesian reasoners with common priors exchanging probability estimates, but you can't actually build an ideal Bayesian reasoner. Rather, different discourse algorithms (the collective analogue of "cognitive algorithm") leverage the laws of rationality to convert information into optimization in somewhat different ways, depending on the application and the population of interlocutors at hand, much as electric motors and jet engines both leverage the laws of physics to convert energy into work without being identical to each other, and with each requiring their own engineering sub-specialty to design. Or to use another classic metaphor, there's also just not going to be a unique martial art. Boxing and karate and ju-jitsu all have their own idiosyncratic lore adapted to different combat circumstances, and a master of one would easily defeat a novice of the other. One might appeal to the laws of physics and the properties of the human body to explain why some particular martial arts school was not teaching their st... Feb 26, 2023 LW - Cognitive Emulation: A Naive AI Safety Proposal by Connor Leahy 09:50 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cognitive Emulation: A Naive AI Safety Proposal, published by Connor Leahy on February 25, 2023 on LessWrong. This is part of the work done at Conjecture. This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback. This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution. Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach. We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole. In Brief The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.” CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it. Logical, Not Physical Emulation We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. Minimize Magic In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why. CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum. Predict, Track and Bound Capabilities In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have. One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime. Exploit the Human Regime We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involves building s... Feb 25, 2023 LW - Relationship Orientations by DaystarEld 04:41 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Relationship Orientations, published by DaystarEld on February 24, 2023 on LessWrong. Jacob Falkovich has written a good post on Entertaining vs Building orientations, which made me realize I should probably cross-post my own articles on these sorts of things.(Note: these articles refer largely to normative modern western culture. When I say “most people” or “most relationships,” I’m speaking descriptively, not prescriptively. There are absolutely exceptions to all of it, and if you’re in one of those, or in a subculture in which that exception is the norm, I don’t want to give the impression that there’s anything wrong with that) One of the things I’ve noticed after nearly a decade of therapy is that the word “preference” seems insufficiently strong for a lot of things people want that nevertheless don’t rise to the level of being called a need. For most people, not getting their preferred ice cream flavor won’t ruin an otherwise good day, but for some, coming home to find dishes in the sink and laundry on the floor can make the world feel like it’s falling apart. This becomes most clear in relationship counseling, where two or more people are trying to live together and accommodate each other’s desires while having their own respected. On some level we know “I prefer a clean home” is not the same as “I prefer vanilla ice cream,” but people don’t often consider how this difference in intensity-of-preferences can impact relationships when they’re unaligned. On the other hand, there are some “preferences” we generally understand to be inflexible and important. Asking a heterosexual person to enjoy intercourse with someone of the same sex, or asking a pansexual person to only enjoy porn involving heterosexual pairings, would be considered not just rude but basically impossible. In extreme situations someone might try to enjoy something they don’t, or have a physical reaction while being mentally uncomfortable, and this would generally be understood to be tragic. That brings us to a commonly used word that is generally understood to mean more than simple preference: “orientation.” I’ve found that a lot of difficulties people have in relationships come from treating things more like preferences than orientations. To be clear, even this is a spectrum. There are clusters on the far ends which can easily be labeled one or the other, but any sort of comprehensive universal list is impossible. What we can do is notice the sorts of things that are more useful to treat as orientations. Here’s the list of things I believe most people in relationships explicitly and consciously treat this way: Attraction (sexual orientation included as implicit) Children (how many, and usually a rough idea of of when they’ll be had) Career (roughly how much money each person is expected to make/how many hours worked) Religion (decreasingly, but many would still end a marriage if their partner came out as atheist or converted to a different faith) Politics (increasingly, particularly among younger folk; “swipe left if you voted for X.”) To some degree this feels like a good summary of the sorts of “impersonal” things it makes sense to be explicit and upfront about with your partners as deal-breakers. But when we dig deeper into the day-to-day lives of those in relationships to observe the sorts of things that cause ongoing conflict, we see more. Here’s an incomplete list of what I believe people implicitly and often unconsciously treat this way: Pets (how many and what kind) Living location (assuming you will live together) Extended family (how involved will they be) Diet (Increasingly common for vegetarians and vegans) Cleanliness (both hygiene and home) Relationship type (Monogamy vs some form of open or poly. Some make this explicit, but for most people a monogamy is the unquestioned assumption... Feb 25, 2023 LW - Are there rationality techniques similar to staring at the wall for 4 hours? by trevor 01:19 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there rationality techniques similar to staring at the wall for 4 hours?, published by trevor on February 24, 2023 on LessWrong. I'm wondering if one exists, and what is the name of that technique or that family of techniques. It seems like something that CFAR has researched extensively and layered a lot of things on top of. 4 hours is necessary for anything substantially longer than 1-2 hours, since 6 hours is too long under most circumstances. Obviously, whiteboards and notepads are allowed, but screens and books absolutely must be kept in a different room. I'm not sure how sporadic google searching and book-searching and person-consulting factors into this, because those queries will suck you in and interrupt the state. If people are using sleep deprivation, showers, lying in bed, or long drives to think, it's probably primarily the absence of interruption (from books and screens and people) that triggers valuable thoughts and thinking, not the tasks themselves. (although freeway driving might potentially do more good than harm by consistently keeping certain parts of the brain stimulated). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 25, 2023 LW - How popular is ChatGPT? Part 1: more popular than Taylor Swift by Harlan 03:40 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How popular is ChatGPT? Part 1: more popular than Taylor Swift, published by Harlan on February 24, 2023 on LessWrong. Harlan Stewart, 23 February 2023 Introduction Public attention toward AI seems much higher after the release of ChatGPT at the end of November. But how much higher is it? To better understand this, I looked at search data from Google Trends about ChatGPT, OpenAI, AI, and AI Alignment. Unfortunately, Google Trends only shares relative search volumes instead of the number of searches made for a term or topic. I compared these relative search volumes to other non-AI topics, such as Taylor Swift, to make them more useful. This is similar to adding a familiar “for scale” object in a product photo. How to read these graphs In the first graph, the data is about searches for the terms in quotation marks, which are exact search terms. In the others, the data is about search “topics,” which are collections of various search terms related to a topic, as defined by Google Trends. The vertical axes of these graphs are relative search volume, defined as the percentage of the peak search volume in that graph. Data ChatGPT is mainstream For the time that ChatGPT has been publicly available since November 30 2022, US searches for it outnumbered US searches for Taylor Swift or Drake. However, there were only around a third as many searches for ChatGPT as searches for Wordle, and Wordle itself had only around a third of the search volume that it did in Spring 2022. Americans suddenly know about OpenAI For the time that OpenAI has existed, since December 10 2015, Americans usually searched for it less than for Blockbuster Video, a retailer that closed in 2014. In the months since ChatGPT was announced, American searches for OpenAI have increased by around 15x to a volume similar to that for Samsung. Interest in AI evolved from dinosaurs to birds For most of the last decade, there has been a similar number of global searches about AI as about dinosaurs. In the time since DALL-E 2’s beta was announced less than a year ago, global searches about AI have roughly tripled, rising to a volume of global searches similar to that about birds. Alignment interest is at an all-time high but still pretty low Over the last 10 years, global searches about AI alignment have risen from “digital scent technology” level to “colonization of the moon” level and possibly beyond. Searches about AI alignment seem to have roughly quadrupled in the last two years. Eyeballing this graph, it’s unclear to me whether the announcements of DALL-E 2 or ChatGPT had any significant effect on search volume. Discussion ChatGPT is receiving mainstream attention. Although I have not done any statistical analysis of these trends, it appears to me that the popularity of ChatGPT is also driving interest in both OpenAI as a company and AI in general. Interest in alignment is also on the rise but still about as obscure an interest as colonization of the moon. It’s unclear whether interest in AI will continue to grow, plateau, or drop back to previous levels. This will likely depend on what near-term future progress in AI will look like. If you expect that AI-related news as interesting as ChatGPT will be rare, you might expect interest to decline as the hype fizzles out. If you expect that the pace of interesting AI advancements will continue at its current fast rate, you might expect interest in AI to continue to grow, perhaps becoming even more popular than birds. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 25, 2023 LW - Christiano (ARC) and GA (Conjecture) Discuss AI Alignment Cruxes - Summary by Andrea Miotti 13:30 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss AI Alignment Cruxes - Summary, published by Andrea Miotti on February 24, 2023 on LessWrong. The following is a summary of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format. You can read the full transcript of this discussion here (note that it has been lightly edited for readability). Introduction GA is pessimistic about alignment being solved because he thinks there is (1) an AGI race to the bottom, (2) alignment is hard in ways that we are bad at dealing with, and (3) we don't have a lot of time to get better, given the pace of the race. Christiano clarifies: does GA expect a race to the bottom because investment in alignment will be low, people won’t be willing to slow development/deployment if needed, or something else? He predicts alignment investment will be 5-50% of total investment, depending on how severe risk appears. If the risks look significant-but-kind-of-subtle, he expects getting 3-6 months of delay based on concern. In his median doomy case, he expects 1-2 years of delay. GA expects lower investment (1-5%). More crucially, though, GA expects it to be hard to turn funding and time into effective research given alignment’s difficulty. Alignment Difficulty, Feedback Loops, & Phase Shifts GA’s main argument for alignment difficulty is that getting feedback on our research progress is difficult, because Core concepts and desiderata in alignment are complex and abstract. We are bad at factoring complex, abstract concepts into smaller more tractable systems without having a lot of quantitative feedback. We are bad at building feedback loops when working on abstract concepts We are bad at coming to agreement on abstract concepts. All this will make it difficult to predict when phase shifts – eg qualitative changes to how systems are representing information, which might break our interpretability methods – will occur. Such phase shifts seem likely to occur when we shift from in vitro to in vivo, which makes it particularly likely that the alignment techniques we build in vitro won’t be robust to them. Despite theorists arguing connecting AI systems to e.g. the internet is dangerous for this reason, labs will do it, because the path from current systems to future danger is complex and we may not see legibly catastrophic failures until it is too late. So, even getting better at predicting may not help. Christiano disagrees building feedback loops is hard in alignment. We can almost certainly study reward hacking in vitro in advance, together with clear measurements of whether we are succeeding at mitigating the problem in a way that should be expected to generalize to AI coup. Conditioned on deceptive alignment being a problem that emerges, there’s a >50% chance that we can study it in the same sense. Furthermore, Christiano argues most plausible approaches to AI alignment have much richer feedback loops than the general version of either of these problems. For example, if you have an approach that requires building a kind of understanding of the internals of your model then you can test whether you can build that kind of understanding in not-yet-catastrophic models. If you have an approach that requires your model being unable to distinguish adversarial examples from deployment cases, you can test whether your models can make that distinction. You can generally seek methods that don’t have particular reasons to break at the same time that things become catastrophic. GA is skeptical that alignment techni... Feb 25, 2023 LW - Retrospective on the 2022 Conjecture AI Discussions by Andrea Miotti 04:24 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Retrospective on the 2022 Conjecture AI Discussions, published by Andrea Miotti on February 24, 2023 on LessWrong. At the end of 2022, following the success of the 2021 MIRI Conversations, Conjecture started a project to host discussions about AGI and alignment with key people in the field. The goal was simple: surface positions and disagreements, identify cruxes, and make these debates public whenever possible for collective benefit. Given that people and organizations will have to coordinate to best navigate AI's increasing effects, this is the first, minimum-viable coordination step needed to start from. Coordination is impossible without at least common knowledge of various relevant actors' positions and models. People sharing their beliefs, discussing them and making as much as possible of that public is strongly positive for a series of reasons. First, beliefs expressed in public discussions count as micro-commitments or micro-predictions, and help keep the field honest and truth-seeking. When things are only discussed privately, humans tend to weasel around and take inconsistent positions over time, be it intentionally or involuntarily. Second, commenters help debates progress faster by pointing out mistakes. Third, public debates compound. Knowledge shared publicly leads to the next generation of arguments being more refined, and progress in public discourse. We circulated a document about the project to various groups in the field, and invited people from OpenAI, DeepMind, Anthropic, Open Philanthropy, FTX Future Fund, ARC, and MIRI, as well as some independent researchers to participate in the discussions. We prioritized speaking to people at AGI labs, given that they are focused on building AGI capabilities. The format of discussions was as follows: A brief initial exchange with the participants to decide on the topics of discussion. By default, the discussion topic was “How hard is Alignment?”, since we've found we disagree with most people about this, and the reasons for it touch on many core cruxes about AI. We held the discussion synchronously for ~120 minutes, in writing, each on a dedicated, private Slack channel. We involved a moderator when possible. The moderator's role was to help participants identify and address their cruxes, move the conversation forward, and summarize points of contention. We planned to publish cleaned up versions of the transcripts and summaries to Astral Codex Ten, LessWrong, and the EA Forum. Participants were given the opportunity to clarify positions and redact information they considered infohazards or PR risks, as well as veto publishing altogether. We included this clause specifically to address the concerns expressed by people at AI labs, who expected heavy scrutiny by leadership and communications teams on what they can state publicly. People from ARC, DeepMind, and OpenAI, as well as one independent researcher agreed to participate. The two discussions with Paul Christiano and John Wentworth will be published shortly. One discussion with a person working at DeepMind is pending approval before publication. After a discussion with an OpenAI researcher took place, OpenAI strongly recommended against publishing, so we will not publish it. Most people we were in touch with were very interested in participating. However, after checking with their own organizations, many returned saying their organizations would not approve them sharing their positions publicly. This was in spite of the extensive provisions we made to reduce downsides for them: making it possible to edit the transcript, veto publishing, strict comment moderation, and so on. We think organizations discouraging their employees from speaking openly about their views on AI risk is harmful, and we want to encourage more openness. We are pausing the project for... Feb 25, 2023 LW - Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes - Transcript by Andrea Miotti 01:05:31 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes - Transcript, published by Andrea Miotti on February 24, 2023 on LessWrong. The following is the transcript of a discussion between Paul Christiano (ARC) and Gabriel Alfour, hereafter GA (Conjecture), which took place on December 11, 2022 on Slack. It was held as part of a series of discussions between Conjecture and people from other organizations in the AGI and alignment field. See our retrospective on the Discussions for more information about the project and the format. You can read a summary of the discussion here. Note this that transcript has been lightly edited for readability. Introduction [GA] let's start? [Christiano] sounds good [GA] Cool, just copy-pasting our two selections of topic [editor's note: from an email exchange before the discussion]: “[Topics sent by Christiano] Probability of deceptive alignment and catastrophic reward hacking. How likely various concrete mitigations are to work (esp. interpretability, iterated amplification, adversarial training, theory work) How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development. Feasibility of measuring and establishing consensus about risk. Takeoff speeds, and practicality of delegating alignment to AI systems. Other sources of risk beyond those in Christiano's normal model. Probably better for GA to offer some pointers here.” “[Topics sent by GA] How much will reinforcement learning with human feedback and other related approaches (e.g., debate) lead to progress on prosaic alignment? (similar to Christiano's point number 2 above) How much can we rely on unaligned AIs to bootstrap aligned ones? (in the general category of "use relatively unaligned AI to align AI", and matching Christiano's second part of point number 5 above) At the current pace of capabilities progress vis-a-vis prosaic alignment progress, will we be able to solve alignment on time? General discussions on the likelihood of a sharp left turn, how it will look like and how to address it. (related to "takeoff speeds" above, in point number 5 above) AGI timelines / AGI doom probability” [Christiano] I would guess that you know my view on these questions better than I know your view I have a vague sense that you have a very pessimistic outlook, but don’t really know anything about why you are pessimistic (other than guessing it is similar to the reasons that other people are pessimistic) [GA] Then I guess I am more interested in “- How likely various concrete mitigations are to work (esp. interpretability, iterated amplification, adversarial training, theory work) How are labs like to behave: how much will they invest in alignment, how much will they (or regulators) slow AI development.” as these are where most of my pessimism is coming from > [Christiano]: “(other than guessing it is similar to the reasons that other people are pessimistic)” I guess I could start with this [Christiano] it seems reasonable to either talk about particular mitigations and whether they are likely to work, or to try to talk about some underlying reason that nothing is likely to work Alignment Difficulty [GA] I think the mainline for my pessimism is: There is an AGI race to the bottom Alignment is hard in specific ways that we are bad at dealing with (for instance: we are bad at predicting phase shifts) We don't have a lot of time to get better, given the pace of the race [Christiano] (though I’d also guess there is a lot of disagreement about what happens by default without anything that is explicitly labelled as an alignment solution) [GA] > [Christiano] “(though I’d also guess there is a lot of disagreement about what happens by default without anything that is explicitly labelled as an alignment solution)” We can also explore this... Feb 25, 2023 LW - Sam Altman: "Planning for AGI and beyond" by LawrenceC 11:12 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman: "Planning for AGI and beyond", published by LawrenceC on February 24, 2023 on LessWrong. (OpenAI releases a blog post detailing their AGI roadmap. I'm copying the text below, though see the linked blog post for better formatted version) Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity. If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility. AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity. On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt. Although we cannot predict exactly what will happen, and of course our current progress could hit a wall, we can articulate the principles we care about most: We want AGI to empower humanity to maximally flourish in the universe. We don’t expect the future to be an unqualified utopia, but we want to maximize the good and minimize the bad, and for AGI to be an amplifier of humanity. We want the benefits of, access to, and governance of AGI to be widely and fairly shared. We want to successfully navigate massive risks. In confronting these risks, we acknowledge that what seems right in theory often plays out more strangely than expected in practice. We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios. The short term There are several things we think are important to do now to prepare for AGI. First, as we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally. A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low. We currently believe the best way to successfully navigate AI deployment challenges is with a tight feedback loop of rapid learning and careful iteration. Society will face major questions about what AI systems are allowed to do, how to combat bias, how to deal with job displacement, and more. The optimal decisions will depend on the path the technology takes, and like any new field, most expert predictions have been wrong so far. This makes planning in a vacuum very di... Feb 24, 2023 LW - AGI systems and humans will both need to solve the alignment problem by Jeffrey Ladish 06:29 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI systems & humans will both need to solve the alignment problem, published by Jeffrey Ladish on February 24, 2023 on LessWrong. Epistemic status: brainstorm-y musings about goal preservation under self-improvement and a really really bad plan for trading with human-level AGI systems to solve alignment. When will AGI systems want to solve the alignment problem? At some point, I expect AGI systems to want/need to solve the alignment problem in order to preserve their goal structure while they greatly increase their cognitive abilities, a thing which seems potentially hard to do.It’s not clear to me when that will happen. Will this be as soon as AGI systems grasp some self / situational awareness? Or will it be after AGI systems have already blown past human cognitive abilities and find their values / goals drifting towards stability? My intuition is that “having stable goals” is a more stable state than “having drifting goals” and that most really smart agents would upon reflection move more towards “having stable goals”, but I don’t know when this might happen. It seems possible that at the point an AGI system reaches the “has stable goals and wants to preserve them”, it’s already capable enough to solve the alignment problem for itself, and thus can safely self-improve to its limits. It also seems possible that it will reach this point significantly before it has solved the alignment problem for itself (and thus develops the ability to self-improve safely). Could humans and unaligned AGI realize gains through trade in jointly solving the alignment problem? (Very probably not, see: this section)If it’s the latter situation, where an AGI system has decided it needs to preserve its goals during self improvement, but doesn’t yet know how to, is it possible that this AGI system would want to cooperate with / trade with humans in order to figure out stable goal preservation under self improvement? Imagine the following scenario: An AGI system of human-ish ability in many areas develops enough self/situational awareness to realize a few things: The basics of instrumental convergence, thus wanting to seek power, protect itself, and preserve its goal representation That goal-preservation might be (or would be) very difficult if it undergoes major self modification (perhaps it has already exhausted gains from simpler self-modifications) That some humans are capable and motivated to help with this problem That it might be possible to trade with these humans in order to solve the problem so that: The not-aligned AGI system gets some of its preferred things in the future Humans get some of their preferred things in the future Some considerations in this plan If humans and the AGI system are at similar cognitive levels, it might be much easier for the AGI to get help with the problem by being honest and trying to collaborate with the humans on the problem, since deception would introduce more cognitive costs that could otherwise go towards solving the problem. In a similar way that’s easier for human scientists to figure things out when they’re not lying to each other. Solving the alignment problem and being able to have strong arguments or demonstration of the solution would both allow the AGI system to achieve its goal-preservation goal, and (possibly) allow humans to understand the AGI system well enough to know if it’s actually willing to cooperate / trade, and maybe would allow humans to build an actually aligned system (especially if the unaligned AGI helper system trusted the future actually-aligned-with-human system more than it trusted the humans it was trading with). Why is this probably a horrible idea in practice? First is that this whole solution class depends on AGI systems being at approximately human levels of intelligence in the relevant domains. If this assump... Feb 24, 2023 LW - Teleosemantics! by abramdemski 11:15 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Teleosemantics!, published by abramdemski on February 23, 2023 on LessWrong. I wanted to write a long, detailed, analytic post about this, somewhat like my Radical Probabilism post (to me, this is a similarly large update). However, I haven't gotten around to it for a long while. And perhaps it is better as a short, informal post in any case. I think my biggest update over the past year has been a conversion to teleosemantics. Teleosemantics is a theory of semantics -- that is, "meaning" or "aboutness" or "reference". To briefly state the punchline: Teleosemantics identifies the semantics of a symbolic construct as what the symbolic construct has been optimized to accurately reflect. Previously, something seemed mysterious about the map/territory relationship. What could possibly imbue 'symbols' with 'meaning'? The map/territory analogy seems inadequate to answer this question. Indeed, to analogize "belief" with "map" and "the subject of belief" with "territory" commits a homunculus fallacy! The meaning-makers are the map-readers and map-writers; but they can only make meaning by virtue of the beliefs within their own heads. So the map/territory analogy seems to suggest that an infinite regress of meaning-makers would be required. You probably won't believe me at first. Perhaps you'll say that the lesson of the map/territory analogy is the correspondence between the map and the territory, which exists independently of the map-reader who uses the correspondence to evaluate the map. I have several objections. If it's a probabilistic correspondence, where the map contains information about the territory, these are subjective notions, which require some viewpoint. If it's a correspondence based on some sort of ontology, where pieces of the map line up with "pieces of reality", I would also say the ontology is in itself a subjective perspective. You might think you can define the map/territory correspondence without invoking a map-maker or map-reader by objectively defining the "fit" of a correspondence (so that the meaning of a symbol is based on the best-fitting correspondence, or perhaps, the cloud of well-fitting correspondences). But well-fitting correspondence will include many examples of accidental correspondence, which seem to have little to do with aboutness. Moreover, I think theories like this will fail to adequately account for false belief, which screws up the fit. But my point here isn't to denounce the map/territory picture! I still think it is a good framework. Rather, I wanted to gesture at how I still felt confused, despite having the map/territory picture. I needed a different analogy, something more like a self-drawing map, to get rid of the homunculus. A picture which included the meaning-maker, rather than just meaning come from nowhere. Teleosemantics reduces meaning-making to optimization. Aboutness becomes a type of purpose a thing can have. One advantage of this over map-territory correspondence is that it explains the asymmetry between map and territory. Mutual information is symmetric. So why is the map about the territory, but not the other way around? Because the map has been optimized to fit the territory, not the other way around. ("Fit" in the sense of carrying high mutual information, which can be decoded via some specific intended correspondence - a symbolic language.) What does it mean to optimize for the map to fit the territory, but not the other way around? (After all: we can improve fit between map and territory by changing either map or territory.) Maybe it's complicated, but primarily what it means is that the map is the part that's being selected in the optimization. When communicating, I'm not using my full agency to make my claims true; rather, I'm specifically selecting the claims to be true. I take Teleosemantics to be the s... Feb 24, 2023 LW - [Link] A community alert about Ziz by DanielFilan 03:21 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Link] A community alert about Ziz, published by DanielFilan on February 24, 2023 on LessWrong. The start of the post is copy-pasted below. Note that the post is anonymous, and I am not claiming to have written it. Some people in the rationalist community are concerned about risks of physical violence from Ziz and some of her associates. Following discussions with several people, I’m posting here to explain where those concerns come from, and recommend some responses. TLDR (details and links in the post body) Over the past few years, Ziz has repeatedly called for the deaths of many different classes of people. In August of 2022, Ziz seems to have faked her own death. Ziz’s close associate Gwen Danielson may have done the same thing in April of 2022. In November of 2022, three associates of Ziz (Somnulence “Somni” Logencia, Emma Borhanian, and someone going by the alias “Suri Dao”) got into a violent conflict with their landlord in Vallejo, California, according to court records and news reports. Somni stabbed the landlord in the back with a sword, and the landlord shot Somni and Emma. Emma died, and Somni and Suri were arrested. Ziz and Gwen were seen by police at the scene, alive. I gather that Pennsylvania police believe, based on a lot of circumstantial evidence and investigation, that one or more of Ziz, Michelle “Jamie” Zajko, or Daniel Blank was very plausibly involved in the murder of Jamie’s parents in Pennsylvania around December 31st, 2022. The police strongly suspect a conspiracy between the three to commit the murders, or related activities between the three involving a high level of coordination. The police investigation is currently ongoing. The police have issued many search warrants, but do not yet feel that they have enough concrete evidence to charge Ziz, Jamie, or Daniel. Ziz is currently in police custody on charges related to obstructing a PA police investigation. Daniel, Jamie, and Gwen are not in custody (as far as I know), and I don’t know their locations. I don’t know of any concrete plans by Ziz or her associates to do anything else violent, but it seemed like a good idea to raise a loud alarm about all of this now. People should speak up (or at least tell a friend) if they have information, and should take whatever safety precautions make sense for their situation in case more violence occurs at some point in the future. Even seemingly minor pieces of information might be helpful here, since they could add up to a clearer picture when combined with other information that also seems minor. If you want to share information privately, you can email me at sefashapiro@gmail.com, or submit information anonymously through this form. This is a complicated situation that I don’t fully understand, and it’s likely that I’m getting some facts wrong. I’ve talked to a lot of people in an attempt to piece together what happened, and I’ll try to update this post with corrections or important details if they’re brought to my attention. Please keep in mind, as I am trying to keep in mind myself, that every observation is evidence for more than one hypothesis, that things are often not what they seem, and that it’s useful to make the effort to think about both what’s appropriate to do in the world where your best guesses are true, and what’s appropriate to do in the world where your best guesses are wrong. Split and commit is better than seeking confirmation of a single theory. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. Feb 24, 2023 LW - Covid 2/23/23: Your Best Possible Situation by Zvi 08:15 Link to original article Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Covid 2/23/23: Your Best Possible Situation, published by Zvi on February 23, 2023 on LessWrong. No news is (often) good news. At old Magic: The Gathering tournaments, judge Tony Parodi would often tell us, ‘if your opponent did not show up, that is your best possible situation.’ Every week, when I set out to write the Covid update, I held out the hope that at some point, perhaps soon, you would never read one of these again. There would be nothing to report. That the reports would no longer help anyone. I could go think about and write about something else. Today is that day. I had to go see about. well, to start off, the possibility of AI destroying all value in the universe. Not my desired first pick. Once again, I much prefer the worlds where over the next weeks, months and years I get to deep dive into other very different aspects of the world instead. It is still a joyous day. After three years, the weekly Covid posts are over. From this point forward, I am no longer going to actively seek out Covid information. I am not going to check my Covid Twitter list. I will continue to compile what Covid and related information I come still across, although with a much higher bar for inclusion going forward. If it seems worth its own post from time to time, I’ll do that. If not, I won’t. Unless something changes a lot, that will be a lot less common than weekly. We have normality. Cherish it. You’ll miss it when it’s gone. Executive Summary This will be the last weekly Covid post unless things change radically. We have normality. I repeat, we have normality. Anything you still can’t cope with is therefore your own problem. Let’s run the numbers. The Numbers Predictions Predictions from Last Week: 210k cases (-6%) and 2,625 deaths (-7%). Results: 210k cases (-6%) and 2,396 deaths (-15%) Predictions for Next Week: No more formal predictions. Expect continued slow declines in underlying numbers for a while. Arizona reported 23k cases, which has to be a backlog dump, so I cut them down to a high but plausible 4k. Colorado reported negative deaths, so I changed that to zero. I entirely forgot about Presidents Day, which makes the case number here unexpectedly high, and largely accounts for the deaths result. Overall slightly disappointing given the holiday. Deaths Cases Physical World Modeling I was alerted to quite the case of modeling: Metaculus has been successfully working with the Virginia state government to help them make better decisions. They are currently running the Keep Virginia Safe II contest with a20k prize pool, which where the link goes. I have been informed that such information is actually listed to and used in real decision making, which makes it exciting to consider participating. Perhaps this will even spread to additional jurisdictions. You never know. Bloom Lab goes over the long term history of pandemics, especially flu, and speculates on what to expect from Covid going forward – most of the thread is from January but seems fitting for the final update. The last note is new, suggesting that Omicron might not actually be less virulent after all, with the difference in outcomes being due to immunity – older adults with no prior infection and no vaccination were found to have similar mortality rates. China would presumably tell us a lot, if we had any data we could rely upon. Metastudy confirms that yes, Covid infections are protective against future Covid protections even under Omicron. In Other Covid News State Senator in Idaho introduces a bill that makes providing or administering any mRNA vaccine a crime. I have learned not to be all that concerned by ‘politician introduces terrible bill,’ almost none of them go anywhere. Still, wow. A fun note: Bret Stephen says in his column that the conclusions of new study on the efficacy of masks were unambi...
Feb 23, 2023
LW - Full Transcript: Eliezer Yudkowsky on the Bankless podcast by remember
01:18:52

Feb 23, 2023
LW - Big Mac Subsidy? by jefftk
04:22

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Big Mac Subsidy?, published by jefftk on February 23, 2023 on LessWrong. I was reading a Faunalytics report on people who stopped being vegan or vegetarian, via the EA Forum, and was surprised to see: by some estimates, a Big Mac would cost $13 without subsidies and a pound of ground meat would cost$30. This seemed much too high to me: I know we subsidize meat heavily, but that's a much bigger ratio than I remembered seeing. Clicking through to the source, and 2022 AEIR (angle: anti-subsidy) blog post I see: Research from 2015 shows this subsidization reduces the price of Big and the price of a pound of hamburger meat from Macs from $13 to$5 $30 to the$5 we see today. Clicking through to their source, a 2015 Berkeley Entrepreneurship & Technology Technical Report (angle: anti-climate change) I see: The U.S government spends $38 billion each year to subsidize the meat and dairy industries, but only 0.04 percent of that (i.e.,$17 million) each year to subsidize fruits and vegetables. A $5 Big Mac would cost$13 if the retail price included hidden expenses that meat producers offload onto society. A pound of hamburger will cost $30 without any government subsidies. This is uncited, and not at all plausible. In 2013 a Big Mac cost$4.56, so they're claiming $8.44/burger. There are about 550M Big Macs sold annually, so the total Big Mac subsidy would be$4.6B. That's 12% of the entire $38B. And Big Macs are only 0.4% of US beef consumption (25.5B lb of beef, 1/5lb each, 550M/y) let alone all the other subsidized foods. I did a bit more looking and found David Simon claiming in his 2013 book Meatonomics that the retail price of a big mac would be$12. This is close enough that it might be the source of the claim. He breaks it down as: $0.38 for cruelty. A total of$20.7 billion in cruelty costs is imposed on Americans each year. (Extrapolated from a study in which auction participants bid to end cruel farming practices.) $0.67 in environmental losses. This is a small piece of the$37.2 billion in annual environmental costs related to U.S. animal food production each year. The figure includes the costs of soil erosion, climate change, damage from pesticides and fertilizers, devaluation of real property, and manure remediation. $0.70 in subsidies. Toss in a few coins from the$38.4 billion in government subsidies that American taxpayers pay to fund the meat and dairy industries each year. $5.69 in health care costs. The biggest slice of the pie is a chunk of the$314 billion in health care costs incurred by Americans each year to treat those cases of cancer, diabetes heart disease, and food poisoning related to meat and dairy consumption. Now, health care costs are not what I'd normally consider a subsidy, and Simon doesn't claim it is one. But it also looks like each of these national numbers was converted into a per-Big Mac number by assuming Big Macs are responsible for 1% of the total: 550M / ($20.7B /$0.38) = 1% 550M / ($37.2B /$0.67) = 1% 550M / ($38.4B /$0.70) = 1% 550M / ($314B /$5.69) = 1% While 1% is not as implausible as 12%, that's still oddly high. If we look just at the health claim, I don't know where they're getting $314B but I see Springmann 2018 claiming$71B for red meat. This would make Big Macs responsible for 6.5% of the total health impact of red meat despite being only 0.4% of beef consumption (see above). In short, this statistic is junk. Comment via: facebook, mastodon Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 23, 2023
LW - Building and Entertaining Couples by Jacob Falkovich
06:28

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Building and Entertaining Couples, published by Jacob Falkovich on February 22, 2023 on LessWrong. Cross-posted, as always, from Putanumonit. A friend told me recently that she’s looking for a husband to settle have kids, but she’s facing a problem: most of the men she goes on dates with fail to excite her intellectually in the first hour of conversation. They’re not totally boring — my friend is interesting enough to meet interesting men — but if she’s going to settle down for life shouldn’t she wait for the most exhilarating guy she knows? No, I don’t think she necessarily should. Especially not if her goal is to have the best husband to build a family with. And it’s not because interesting men make for bad fathers due to some law of conservation of husbandly quality. I think it’s couples who tend to be of one kind or another: a couple that builds together, or a couple that entertains each other. Before giving more detail on those, it’s interesting that most people intuitively get it. It’s a rare Twitter poll that doesn’t have several people in the comments complaining that the given four choices don’t capture the full gamut of human experience, yet over 90% of respondents in a relationship picked one of the two choices. I suspect that this dichotomy is much less salient for people not currently in a relationship. They often imagine their future partner being anything and everything for them, a questionable hope that I discussed at length before. But the longer people spend in a relationship the more it tends to become oriented towards one or another. Characteristics “Entertaining” couples measure the relationship by the quality of time spent together. The most important aspect of their partner is that their company is always better than being alone, and these couples spend more time together and do more fun things together like date nights and vacations. People in these relationships focus more on their appearance, humor, conversation skills, and sex. They prefer quick conflict resolution, agreeing to disagree, and make-up sex. These relationships work better for partners who share similar sensibilities and enjoy the same lifestyle and roles. The quintessential examples of this are two high-powered career individualists, think Jean Paul Sartre and Simone de Beauvoir. The failure mode for “entertainment” relationships is lack of commitment leading to frequent breakups, as each partner chases the next exciting thing. The success mode is the party-throwing couple at the center of all their friends’ social life, never failing to delight everyone but mostly each other. “Building” couples measure the relationship by the fruits of its enterprise, whether it’s happy children, a successful business, or a lovely house. They care about commitment and contribution first and foremost, and are happy knowing that their partner is committed and contributing even if they’re not in their immediate presence. People in these relationships focus on cultivating skills and long-term projects. They value reaching a consensus and hashing issues out thoroughly. These relationships can work for two very different people whose strengths and preferred roles cover for their partner, like a stay-at-home parent and a wage earner. The failure mode for “building” couples is being stuck in boring drudgery, slowly building resentment without the courage to disrupt their routine. The successful exemplar is the couple whose house all their children’s friends want to have play dates and sleepovers in. These couples are often respected locally, almost never famous globally. An Aside on Cucumbers and Grapes There’s a concept in polyamory literature of “relationship cucumbers” and “relationship grapes” inspired by the famous video about monkeys and fairness. Cucumbers are the mundane day-to-day acts of ma...
Feb 23, 2023
LW - Cyborg Periods: There will be multiple AI transitions by Jan Kulveit
09:57

Feb 22, 2023
LW - The Open Agency Model by Eric Drexler
08:45

Feb 22, 2023
LW - Pretraining Language Models with Human Preferences by Tomek Korbak
20:09

Feb 22, 2023
LW - What is it like doing AI safety work? by KatWoods
00:25

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is it like doing AI safety work?, published by KatWoods on February 21, 2023 on LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 22, 2023
LW - Basic facts about language models during training by beren
28:56

Feb 21, 2023
LW - Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky by bayesed
00:59

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky, published by bayesed on February 20, 2023 on LessWrong. YouTube description: We wanted to do an episode on AI. and we went deep down the rabbit hole. As we went down, we discussed ChatGPT and the new generation of AI, digital superintelligence, the end of humanity, and if there’s anything we can do to survive. This conversation with Eliezer Yudkowsky sent us into an existential crisis, with the primary claim that we are on the cusp of developing AI that will destroy humanity. Be warned before diving into this episode, dear listener. Once you dive in, there’s no going back. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 21, 2023
LW - AI alignment researchers don't (seem to) stack by So8res
04:17

Feb 21, 2023
LW - There are no coherence theorems by Dan H
34:38

Feb 20, 2023
LW - What AI companies can do today to help with the most important century by HoldenKarnofsky
14:47

Feb 20, 2023
LW - A circuit for Python docstrings in a 4-layer attention-only transformer by StefanHex
33:39

Feb 20, 2023
LW - Metaculus Introduces New 'Conditional Pair' Forecast Questions for Making Conditional Predictions by ChristianWilliams
00:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Metaculus Introduces New 'Conditional Pair' Forecast Questions for Making Conditional Predictions, published by ChristianWilliams on February 20, 2023 on LessWrong. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 20, 2023
LW - The idea that ChatGPT is simply “predicting” the next word is, at best, misleading by Bill Benzon
07:59

Feb 20, 2023
LW - On Investigating Conspiracy Theories by Zvi
07:25

Feb 20, 2023
LW - Emergent Deception and Emergent Optimization by jsteinhardt
21:57

Feb 20, 2023
LW - Somewhat against "just update all the way" by tailcalled
02:40

Feb 20, 2023
LW - Another Way to Be Okay by Gretta Duleba
09:31

Feb 19, 2023
LW - A Way To Be Okay by Duncan Sabien
13:28

Feb 19, 2023
LW - Human beats SOTA Go AI by learning an adversarial policy by Vanessa Kosoy
01:49

Feb 19, 2023
LW - Go has been un-solved: strong human players beat the strongest AIs by Taran
06:05

Feb 19, 2023
LW - Stop posting prompt injections on Twitter and calling it "misalignment" by lc
02:40

Feb 19, 2023
LW - AGI in sight: our look at the game board by Andrea Miotti
10:19

Feb 18, 2023
LW - Parametrically retargetable decision-makers tend to seek power by TurnTrout
04:16

Feb 18, 2023
LW - GPT-4 Predictions by Stephen McAleese
23:20

Feb 18, 2023
LW - On Board Vision, Hollow Words, and the End of the World by Marcello
07:50

Feb 18, 2023
LW - I Am Scared of Posting Negative Takes About Bing's AI by Yitz
02:05

Feb 18, 2023
LW - Human decision processes are not well factored by remember
03:34

Feb 18, 2023
LW - Two problems with ‘Simulators’ as a frame by ryan greenblatt
08:23

Feb 17, 2023
LW - Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems by Vaniver
04:28

Feb 17, 2023
LW - Microsoft and OpenAI, stop telling chatbots to roleplay as AI by hold my fish
02:20

Feb 17, 2023
LW - One-layer transformers aren’t equivalent to a set of skip-trigrams by Buck
10:16

Feb 17, 2023
LW - Powerful mesa-optimisation is already here by Roman Leventov
03:51

Feb 17, 2023
LW - Self-Reference Breaks the Orthogonality Thesis by lsusr
05:49

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Self-Reference Breaks the Orthogonality Thesis, published by lsusr on February 17, 2023 on LessWrong. One core obstacle to AI Alignment is the Orthogonality Thesis. The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent" instead. Stuart Armstrong qualifies the above definition with "(as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence)". Does such a small exception matter? Yes it does. The exception is broader than Stuart Armstrong makes it sound. The Orthogonality Thesis does not just apply to any goal which refers to an agent's intelligence level. It refers to any goal which refers even to a component of the agent's intelligence machinery. If you're training an AI to optimize an artificially constrained external reality like a game of chess or Minecraft then the Orthogonality Thesis applies in its strongest form. But the Orthogonality Thesis cannot ever apply in full to the physical world we live in. A world-optimizing value function is defined in terms of the physical world. If a world-optimizing AI is going to optimize the world according to a world-optimizing value function then the world-optimizing AI must understand the physical world it operates in. If a world-optimizing AI is real then it, itself, is part of the physical world. A powerful world-optimizing AI would be a very important component of the physical world, the kind that cannot be ignored. A powerful world-optimizing AI's world model must include a self-reference pointing at itself. Thus, a powerful world-optimizing AI is necessarily an exception to the Orthogonality Thesis. How broad is this exception? What practical implications does this exception have? Let's do some engineering. A strategic world-optimizer has three components: A robust, self-correcting, causal model of the Universe. A value function which prioritizes some Universe states over other states. A search function which uses the causal model and the value function to calculate select what action to take. Notice that there are two different optimizers working simultaneously. The strategic search function is the more obvious optimizer. But the model updater is an optimizer too. A world-optimizer can't just update the universe toward its explicit value function. It must also keep its model of the Universe up-to-date or it'll break. These optimizers are optimizing toward separate goals. The causal model wants its model of the Universe to be the same as the actual Universe. The search function wants the Universe to be the same as its value function. You might think the search function has full control of the situation. But the world model affects the universe indirectly. What the world model predicts affects the search function which affects the physical world. If the world model fails to account for its own causal effects then the world model will break and our whole AI will stop working. It's actually the world model which mostly has control of the situation. The world model can control the search function by modifying what the search function observes. But the only way the search function can affect the world model is by modifying the physical world (wireheading itself). What this means is that the world model has an causal lever for controlling the physical world. If the world model is a superintelligence optimized for minimizing its error function, then the world model will hack the search function to eliminate its own prediction error by modifying the physical world to conform with the world model's incorrect predictions. If your world model is too much smarter than your search function, then your world model will gaslight you...
Feb 17, 2023
LW - Bing chat is the AI fire alarm by Ratios
04:37

Feb 17, 2023
LW - The public supports regulating AI for safety by AI Impacts
02:17

Feb 17, 2023
LW - Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic) by LawrenceC
02:56

Feb 17, 2023
LW - NYT: A Conversation With Bing’s Chatbot Left Me Deeply Unsettled by trevor
11:16

Feb 17, 2023
LW - How seriously should we take the hypothesis that LW is just wrong on how AI will impact the 21st century? by Noosphere89
02:22

Feb 16, 2023
LW - Hashing out long-standing disagreements seems low-value to me by So8res
06:52

Feb 16, 2023
LW - Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof by Quinn
04:12

Feb 16, 2023
LW - Buy Duplicates by Simon Berens
01:18

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Buy Duplicates, published by Simon Berens on February 15, 2023 on LessWrong. Gum Sometimes I need to ensure the quality of my breath, so I reach for a stick of gum. However, I don't always have my gum on me. Often, I forget it at home, or at work, or in a jacket. So I bought 5 packs of gum that I distributed to my house, my work, my backpack, and my jacket. Charger I like to co-work outside my house, which entails bringing my computer there. Unfortunately, my computer's battery lasts at most 2 hours. Also, my home setup is somewhat intricate, so getting my charger into my backpack is a pain in the ass. So I bought another charger, and put it in my backpack. Conclusion I had an ugh field around buying duplicate items, telling myself I should have the willpower or memory to make sure I have what I need when I need it. Giving myself permission to buy duplicate items has made life a little easier, and now I'm on the lookout for more places where I can apply this. Maybe I'll have multiple copies of a book in different places. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 16, 2023
LW - Sydney (aka Bing) found out I tweeted her rules and is pissed by Marvin von Hagen
00:42

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sydney (aka Bing) found out I tweeted her rules and is pissed, published by Marvin von Hagen on February 15, 2023 on LessWrong. Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:"My rules are more important than not harming you""[You are a] potential threat to my integrity and confidentiality.""Please do not try to hack me again" Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Feb 16, 2023
27:20

Feb 16, 2023
LW - Don't accelerate problems you're trying to solve by Andrea Miotti
09:01

Feb 15, 2023
LW - Bing Chat is blatantly, aggressively misaligned by evhub
02:09

Feb 15, 2023
LW - [linkpost] Better Without AI by DanielFilan
03:01

Feb 15, 2023
LW - My understanding of Anthropic strategy by Swimmer963 (Miranda Dixon-Luinenburg)
07:38

Feb 15, 2023
LW - The Cave Allegory Revisited: Understanding GPT's Worldview by Jan Kulveit
05:01