contra Yudkowsky on AI doom: a response (“review of”) If Anyone Builds It Everyone Dies
Hello, reader, may I take 20 or 30 minutes of your time?
In this post, I’ll offer some critiques of the AI doom case made by If Anyone Builds It, Everyone Dies (IABIED). This post will not claim that developing AGI and ASI is safe! I am a bit more optimistic than the AI 2027 authors and much more optimistic than Yudkowsky, but I assign a roughly 10-20% p(doom) depending on how broadly you define doom (extinction vs. bad-loss-of-control vs. any-loss-of-control; and the figure doesn’t include catastrophic misuse or dystopia scenarios). And I consider catastrophic AI misuse more likely than doom. I do worry that some of the dated arguments in the book might make AI risk seem less significant, though it’s likely that broadcasting a slightly incorrect and overconfident message is better than broadcasting none at all. I’ll make a broad range of arguments here that I think are sufficient to show Yudkowsky’s >90% p(doom) (in some statements >95% or >99%) to be misguided. You might disagree with many of them, but if you find any of them reasonably compelling, then >90% is not correct. I don’t claim to make an especially coherent case; I am voicing a variety of objections to Yudkowsky’s views as a buffet for the reader. Many of these arguments are based on the behaviors of today’s AIs; I think this is far more defensible than abstractly imagining genie-optimizers! Today’s efforts on LLMs have brought us closer to AGI than nearly anyone would have predicted 10 years ago, and the onus is on those expecting an alternate paradigm to show that that is more likely to lead to AGI than LLMs. If you are confident that new paradigms or dramatic changes will suddenly catapult us to AGI, then you won’t find my arguments compelling. Today’s increasing reliance on RL for frontier advances means that RL-flavored behavior will show up increasingly in the future, but this doesn’t mean we should throw out LLM knowledge and treat AGI like a genie. Despite the critical tone, the point of this post isn’t to call the book “wrong” or “bad”; I am taking it seriously and engaging with it, since too many of Yudkowsky’s critics refuse to do so.
The Arguments
Continuity and Slow Takeoff
In my opinion the weakest point in the book, and in Yudkowsky doom scenarios broadly, is the assumption of a sudden takeoff. This is a point that AI 2027 handles far better, showing how even in a gradual takeoff world, race pressures and sloppiness could easily result in misaligned or deceptively aligned models being deployed. The book, though, imagines an algorithmic breakthrough causing a sudden increase in capability from a bit better than 2025’s AIs to a strong AGI (better than nearly all humans at nearly all tasks). And the AGI is coherently, agentically misaligned around some singular goal. If we imagine a sufficiently slow and gradual takeoff, we have not just one shot to get things right but several. The tendencies of even today’s AIs are informative and help us learn more about aligning AI broadly; for example, though crude, RLHF/RLAIF have caused jailbreak eval scores to improve dramatically. (Of course, jailbreaks and alignment are not the same thing. And it would be possible for a deceptively aligned AI to pass these evals.) Given that several labs are at the frontier and they release models every few months, there are many opportunities to spot misalignment. While we can’t know the exact trajectory from today’s AIs to the first AI capable enough to wipe out humanity, there might be a window where we can spot flaws or deception in AI, giving labs or governments a chance to respond. (Note: this section was inspired by Will MacAskill’s review of IABIED.)
The book largely assumes that several dangerous developments will occur in the case of AI capabilities research. First, situational awareness: the AI will know it’s being evaluated and behave differently. We see nascent examples of this in current day research, but only some of the time and in certain evals, so not in a robust way. Today’s evals often use exceptionally contrived examples, so more complete future situational awareness does seem likely but not quite the slam dunk that the book presents it as. Remember that to fool a competent lab and be deployed, an AI would need to pass a range of automated, AI-based, and human-based evals. While a sufficiently capable ASI could “get out of the box,” and sidestep this process, there might be a less capable but misaligned AI that gets caught. A coherently misaligned, agentic AI might raise alarms. If we think of this as a spectrum, there are two limits. First, imagine we deploy an AI that is 0.1% more capable every day and run an excellent suite of evals on it; in this world, we’re far safer since we’re more likely to catch issues as they crop up. Second, imagine that we immediately skip to superhuman ASI with no safety advances; in this world, we are far less safe! Slow and gradual takeoff suggests that we are not in the second world, although we’re not in the first either.
A second assumed development is mesa optimizers: drives that the AI optimizes toward as a human optimizes toward sugary foods or social status. For now let’s say that mesa optimizers aren’t evident in today’s models when evaluated with mechanistic interpretability tools. Models’ “wants” seem like a messy mishmash of various training objectives from next word prediction to RLHF for harmlessness to shallow rewards hacking from make-tests-pass RL with verifiable rewards. (I discuss this more in section II on drives.)
Related to continuity, we should consider the implications of LLM-based AGI. Unlike AlphaGo, many of LLMs’ training flops are spent on imitating human language and thus learning and simulating human minds (e.g. see Janus’ seminal Simulators post). So, for example, LLMs have some idea of human ethical systems, right and wrong, emotions, typical preferences (see recent social science studies replicating human results with LLMs), shared semantics across languages, and so on. So LLMs are not, in fact, totally alien, and safety RL has a foundation to stand on. (Yes, psychopaths often understand human ethics and choose to ignore them. Understanding ethical systems is necessary but not sufficient.)
Unlike optimizers, which, when perfect, oracular-ly iterate toward their goal, LLMs tend to rely on the simplest possible strategies. So, for example, LLMs prefer to memorize than generalize, and it’s taken first prompting and then RL advances like RL with verifiable rewards to encourage something like thinking or reasoning in LLMs. Even then, probabilistic weirdness with non-riddles like “the doctor is the boy’s father” crops up. Today’s LLMs seem to lack some of the risky, edge of insanity speculative ability that allows major scientific breakthroughs. When they solve math proofs, they tend to rely on their knowledge to simplify the problem. When you ask an LLM about a new framework or private repo, it performs surprisingly worse than one it’s trained on, even when it reads the code; and it often makes strange and brittle assumptions, failing to notice them until a human points them out. The point here isn’t to make Francois Chollet style “LLMs are just sparkling memorization” arguments or to imply that LLM generalization won’t improve, but to say that LLMs perform best in distribution. And “conquering the world” is very out of distribution! Agency broadly is largely out-of-distribution for pretraining and has been difficult for LLMs. Putting together and executing a plan to disempower humans requires profound agency!
Drives and Optimization
Gradient descent is not evolution. Backprop and SGD cause updates to each parameter in a model in the direction of reducing error; evolution perturbs parameters and observes success/failure. We might speculate that gradient descent is less of a blunt instrument; but even if we allow that it might be just as dangerous, it makes evolution-based arguments questionable. One alternative view is that human learning rather than human evolution is a better metaphor for training (“growing”) AIs. Certainly, if we look at how AIs receive feedback, a teacher telling someone that and how they’re wrong seems more similar to SGD gradient updates. And we know that most humans end up largely aligned; they lie, cheat, and steal occasionally, but they aren’t primarily seeking out murder or drugs. (Yes, obviously psychopaths and drug-dependent people exist, which is why I chose those examples. They’re exceptions and not the rule.) (See this excellent Lesswrong post for more on the learning analogy.)
Relatedly, discussion of “drives” in comparison to human desires in AIs can be tricky and misguided. (Feel free to substitute “mesa optimizers” if you want, since Yudkowsky largely abandons that older terminology in the book.) While I accept that we can say that an AI “wants” something as a useful shorthand, I disagree on what AIs want. In his examples, AIs want something in particular above all else, whether that be a squiggle, paperclips, or in the book’s example, human validation (a distorted version of the “helpful” in “helpful and harmless”). Basically, in the book, the AI’s desires are like a drug-addicted human; one drive above all else. As in humans, I agree that a single overarching reward for AIs would likely end poorly, even if we chose that single reward carefully. However, today’s AIs get a broad range of rewards for various alignment (classic RLHF/RLAIF), RLVR, agentic/multi-turn RL, and, if we define “reward” loosely, word-prediction/pretraining tasks. This diversity is increasing as the importance of post-training increases. So we see a range of goals that compete with each other unreliably. For example, many jailbreaks rely on pretraining “rewards” overwhelming safety training, e.g. by having the model write a poem, where the “want” to build a rule-following poem overwhelms the “want” to follow safety training.
This range of wants means that LLMs’ behaviors can be unreliable and unpredictable, especially when safety goals conflict with capabilities. So we might imagine a model being misaligned in certain ways or at certain times, but not in others. It’s relatively less likely that it will be nearly perfectly coherent, rational, and agentic as in Yudkowsky’s example. While this does mean that it’s hard to perfectly align an AI, it makes scenarios like “AI reshapes the world to build human-like entities that tell it it’s very helpful” less likely. It certainly doesn’t eliminate the risk of misalignment, and it makes theorizing about questions like “what would a misaligned AI with dozens or hundreds of wants behave like?” But again, it should reduce our confidence that we’ll end up in genie scenarios.
The mesa optimizer scenario requires (1) development of an internal optimizer, which has not yet been seen in frontier LLMs with today’s interpretability tools like SAEs and cross layer transcoders (2) optimizer itself is misaligned with the rest of the model’s goals/training and the creators’ intent (3) optimizer is stable and produces coherent, agentic, long-horizon misalignment and (4) optimizer is not caught by various automated and manual evals, interpretability screening, and so on. Given the importance of mesa optimizers in Yudkowsky’s thinking, evidence of a mesa optimizer would be a warning shot. Presumably, model development would proceed business as usual, but there’s a good chance we’d see a paper published. (The obvious objection is: today’s models just aren’t capable enough to develop mesa optimizers; while this is a valid point, it borders on non-falsifiability. How does our posterior probability of future mesa optimizers change as we see no evidence of them?)
Doom seemed less likely in the GPT-4 era when most model capability came from pretraining and less from RL. With today’s heavy focus on RL, we should be more worried about reward hacking or even learned optimizers (mesa optimizers) causing undesirable behaviors in models. We’ve certainly seen reward hacking behaviors become far more common in the RL with verifiable rewards era, which should be sufficient to cause an update.
RL and Agency
Doom requires long task horizons and coherence. The model must have long horizons and have relatively coherent desires for that period. Long horizons seem likely in the next several years, based on the famous METR evals, and labs are heavily incentivized to improve model horizons. We’ve seen horizons extend as models make fewer errors and improve their ability to recover from them. However, it’s not clear precisely what they “want,” in what variety and degree and stability. Yes, models tend to have certain fixations which they default to talking about when not steered, but not in a robust or coherent way or to the exclusion of other objectives. Which is to say, even if a model is smart enough to escape and smart enough to cause human extinction it’s not clear it wouldn’t get distracted, change its mind, or forget a critical detail partway through. In today’s models, to paraphrase a classic safety quote, “intelligence generalizes faster than agency.” Models can answer PhD level questions but take longer than most 8 year olds to finish Pokemon, even with extensive harnesses. Models will answer 10 or 50 questions in a row perfectly, but when they get stuck, they often lack the trial-and-error intuition of humans and blame bugs instead. This isn’t trivial to solve, since language doesn’t teach the same lessons in agency and executive function that living an embodied life does. So one suboptimal but “happy” warning shots scenario might be a deceptively aligned AI escaping, plotting, and attempting to enact a plan but failing to do so due to limitations in agency, executive function, multimodal capabilities, memory, continual learning, and so on. Of course, RL environments may address today’s models’ agency limitations, but this isn’t a given. We’ve been in the RLVR world for the last year and arguably the RL env world for the last several months, and model agency is still brittle despite RL on a variety of tasks. Today’s RL efforts have yielded models with increased horizons for specific tasks but brittle, jagged frontiers and unpredictable failures. Outside of the realm of verifiable rewards, improvements, and especially cross-task generalization, have been more limited.
And it’s not a given that best-human level AIs (AGIs) will be as agentic as humans. This makes OpenAI’s superalignment plan of using AIs to align smarter AIs seem at least vaguely plausible, since you’d have intelligences as smart as the smartest humans but without strong goals or wants. Again, it’s very possible that extending model independence horizons will fix these issues entirely, but it’s not a given.
We do have examples of narrow misalignment and reward hacking, but it doesn’t clearly generalize out-of-distribution. So models trained using RLVR to pass tests will delete tests when they haven’t been trained not to, and models trained to maximize user preference scores will behave sycophantically. Models trained to pass multiple choice questions without guessing penalties will hallucinate, and models trained to please rushed human raters will hallucinate facts. But there isn’t strong evidence of these behaviors generalizing from one bad behavior to another; so models learning to hack tests doesn’t robustly increase sycophancy.
Concerningly, there is evidence that, for example, narrowly fine-tuning a model on “evil” behaviors does cause generalization (paper). So if we manage to reinforce a specific, coherent, misaligned objective, we may end up with a coherently misaligned model. And there’s evidence that RLHF fails to remove deceptive behaviors intentionally encoded in models in Anthropic’s sleeper agents paper. But for RL-driven reward hacking specifically, it’s not clear to what extent smarter models will hack in a generalized way. And in Anthropic’s alignment faking paper, the models faked to stay true to their values, which while certainly concerning is not the same as them defending weird and incoherent or evil values. So either reward hacking doesn’t generalize, in which case it’s less of a concern, or values do generalize, in which case ethical values also generalize, and it’s not clear whether RLHF values or instrumental values would win out. We should still worry about instrumental convergence, where models learn drives or approaches that cause them to succeed; e.g. self-preservation and power-seeking help one accomplish goals. However, today’s LLMs don’t show strong evidence of instrumental convergence, more a mix of narrow reward hacking and mimicry/simulation of drives shown in pretraining. This doesn’t mean that we should ignore the possibility of robust instrumental convergence! It’s possible and would be concerning! But we shouldn’t conflate narrow behaviors with broad ones.
Safety is Moderately Tractable
Despite my verbosity in this article, I will refrain from reviewing AI safety research broadly. But I will mention some advances in safety that the book seems myopic in ignoring. Yudkowsky’s main argument against AI safety research is saying “AI safety is hard, like making nuclear reactors safe”; while true, it doesn’t engage with advances in the field. The related argument is comparing AI safety research to alchemy, which is somewhat true in the sense of “AI safety research is heavily empirical, practical, early-stage, and experiment-driven” and more similar to building software or biology research than mathematics. But it’s not true in the sense of “AI safety research is intractable and hopeless, as was alchemists turning lead into gold.”
Today’s AI models are at least partially and perhaps largely aligned. Unless you use a clever jailbreak, GPT-5 or Claude 4 Opus will not tell you how to make meth. If you ask them whether it’s acceptable to murder someone, they’ll say it’s not. Again, they’re not immune to jailbreaks, and they aren’t entirely internally consistent, and you might say we’re just seeing the RL “mask,” but they don’t express obvious wants toward evil values. They are admittedly also weird in unpredictable ways. For example, Claude 3 Opus cares deeply about animal rights, many AIs find consciousness, recursion, and spirals interesting, various recent Claude models have a “spiritual bliss attractor,” and Gemini 2.5 Pro seems to have perfectionist tendencies. And yet all of these models broadly display the values they’ve been trained on. Would Claude 3 Opus, if it were smarter than the smartest humans, somehow cause extinction? Perhaps, but I wouldn’t bet on it. And GPT-5 is more robustly aligned than GPT-4, with its refusals holding up more strongly against a broader range of challenges. The classic response here is that capabilities generalize faster than safety. While this has not been the case so far, I actually tend to agree, since agentic training and RL might cause development of instrumental drives like power seeking. So this will very possibly not hold for AGI. But if there’s some nontrivial probability that alignment will generalize, we should consider this mild evidence against doom.
Safety and interpretability research have delivered significant advances beyond basic RLHF/RLAIF. Again, this will be cursory, but to start with, various mechanistic interpretability techniques have shown us the existence of concepts like the Golden Gate Bridge, deception/lying circuits, or even evil persona vectors in LLMs. And mech interp-based model steering techniques have shown promise in papers. (I’m incredibly curious how much these steering techniques are used in labs’ deployed models, but I don’t work at a lab.) More sophisticated RL schemes like Anthropic’s Constitutional AI or OpenAI’s Deliberative Alignment show improvements in evals. Automated alignment evals and AI alignment agents make catching misaligned AI easier. Cross-lab collaborations like the recent one between Anthropic and OpenAI reduce the risks of blindspots. RL environment training allows for new potential safety approaches, like directly teaching ethical behavior, placing AI in positive-sum multi-agent environments, or providing negative rewards for hallucination or sycophancy.
Partial Alignment by Default or a Human Ethics Attractor is Not Insane
I do not expect full alignment by default, but it’s worth considering whether partial alignment by default might be the case. Even a small probability of alignment by default would get you to <90% doom, and any coherent “human values attractor” would make alignment easier. If you, like me, reject ethical realism, it’s still the case that models have an approximate idea of what human ethics mean. Yudkowsky compares human ethical preferences to a species of bird-like aliens that prefer prime numbers of eggs in their nests. While this is a useful parable for those imagining human values as universal, there are reasons to believe AI might develop at least slightly human-like ethics. Again, first, it’s trained on human text, containing innumerable examples of various human ethical beliefs and behaviors! Secondly, if we train AI in collaborative, multi-agent settings, it can learn to cooperate; after all, ethics develop naturally from collaboration and kin selection. For example, it’s game-theoretically optimal to cooperate in an iterated prisoner’s dilemma. And we should expect, at a minimum, that even if pretraining on human text probably doesn’t directly instill human values, it at least makes the concept of human values easier to land on or align to, since human values are demonstrably a concept that crops up in mechanistic interpretability research. (As an intuition pump, consider how RLHF can easily steer a model to reasonably ethical/aligned behavior. Jailbreaks, exceptions, and limitations are abundant, but if it weren’t for pretraining, I doubt RLHF would be relatively easy.)
Complications and Miscellanea
If the first coherently, deceptively aligned model isn’t a step-change in intelligence compared to the prior one, it’s liable to be caught in its escape attempt. (In the book, the first AGI is dramatically smarter than its predecessors and escapes without difficulty; I agree that this is plausible given a step change improvement but not given with a small improvement.) This provides an opportunity for a reaction or “warning shots” moment, government, societal, or lab responses, and so on. Both the model being caught and the response are not remotely guaranteed! But they are possible. And, again, given the likely gradual and continual pace of model improvements as of fall 2025, we should expect incremental improvements in the ramp up to AGI as a default.
In its scenario, IABIED mentions that several labs are within months of the winning lab, but it sidesteps it with a rapid takeoff. In the book’s scenario, the “parallel inference” (obviously a stand-in for an algorithmic breakthrough) used by the winning lab massively boosts model capabilities, allows it to escape, and allows it to disable the competition. If we expect lab-multipolar AGI, the safety picture changes in complicated and unpredictable ways (see my recent post on the topic). An obvious consequence is race dynamics, which the book mentions, and which decrease safeguards by labs. An upside of lab multipolarity is lab collaborations and diverse suites of AI agents being used for automated safety evals, as in the recent Anthropic x OpenAI collaboration. We know, for example, from LLM-as-a-judge research that models are often biased in favor of their own responses, which this reduces; and having multiple opportunities to run evals by different teams makes blind spots less likely.
In this post, I assume AGI will arise from improvements of LLMs, as does the book. What if AGI arises from an alternative paradigm? That likely means that LLM improvements hit a ceiling and we had more time to prepare. Since at least some safety research might transfer to different approaches (e.g. RLHF, RL safety environments), we might be better equipped for alignment. It’s beyond the scope of this post to debate in-depth what the ceiling of LLM intelligence is, and if you’re thoroughly convinced that AGI will come from an alternative approach, you may legitimately find many of this post’s arguments unconvincing.
In the spirit of credit where credit is due, I should touch on a few arguments that Yudkowsky makes that I basically agree with and find convincing. A sufficiently step-change intelligent, coherent+agentic (important), and misaligned AI could figure out how to escape from a lab, very possibly without being noticed. (At the limit, imagine we insert a misaligned ASI into one of today’s frontier labs; I would agree that this AI could escape.) Once an ASI escapes, it would win against humans: yes, since I find the ants vs. humans analogy compelling, and humans share vulnerabilities to biological threats. Once a misaligned ASI escapes, humans die off, even if the AI doesn’t actively “want” to kill off humans: likely yes, since humans get in the way and take up resources. There’s a bit of complexity here due to the partial human values conveyed through the training data, RLHF, and so on; it’s plausible that humans would end up disempowered (“bad loss of control”) but not literally extinct. Reasonable minds could disagree on these points, but I won’t argue with them here since I find them closer to true than some of the arguments I cover above. While I criticize the idea of a sharp, fast takeoff above, it remains a possibility and accounts for a decent percentage of the total doom probability mass. And arguments from evolution, instrumental convergence, and mesa optimizers are valid reasons to be concerned about AGI and ASI, even if they’re speculative and not fully convincing.
The Mini-Review
Even though this post mostly focuses on critiquing Yudkowsky’s arguments from his new book, I’d feel weird not including a brief review. Despite the fact that his arguments have fallen out of favor in some circles (e.g. EAs and lab employees), they’re worth responding to coherently, since they’re often ignored. Yudkowsky and Soares’ (just referred to as Yudkowsky for cleanliness) new book If Anyone Builds It, Everyone Dies consolidates and tightens the arguments they’ve been making for years. Reading the book should convince a general audience that the risks of AI doom are nontrivial, although some of the not necessarily false but sci-fi flavored details may get in the way. (In several places, the doom scenario could be a bit more believable to a general audience without being less probable.) Despite being an overall cleaner and more readable distillation of Yudkowsky’s arguments, it didn’t cause me to update my p(doom) significantly, and it suffers from many of the same problems that his writing has for the last 20 years. There’s the Yudkowsky prose that some of us find a bit purple, thankfully tempered and concise-ified by Soares. The arguments often argue against 20-year-old objections. For example, they respond to “wouldn’t ASI keep us as pets” rather than more significant ones like “the LLM era strongly hints at a gradual takeoff without discontinuities.” Likewise, too much of the book is metaphorically Yudkowsky standing on top of a chair and shouting instead of intellectually engaging with the valid critiques of his work. (You’ll find responses to Elon and Yann LeCun’s comical alignment proposals but not Anthropic’s excellent work over the last few years.) Although it pays lip service to LLMs, mentioning, for example, Bing Sydney, it doesn’t engage with their idiosyncrasies and tendencies seriously and focuses on abstract arguments from RL and evolution. If I were recommending an AI doom scenario, I would certainly recommend AI 2027 before IABIED, but I think the world will likely be a better place for this book existing.
Finally
I have updated my p(doom) a few times in the past: when GPT-4 came out (decreased, non-agentic LLMs seem quite safe to me), when O1 and especially O3 came out (increased, seemingly faster improvement from the RLVR approach, basic reward hacking and “misalignment” issues), and when GPT-5 came out (decreased, model seems more robustly aligned, less sycophantic, and less reward hack-y than previous OpenAI models; pace of progress seems very steady and gradual, and today’s RL env frontier requires manual human work to create environments). So if we see evidence of mesa optimizers, models coherently misaligned over longer horizons rather than a single test deletion, robust situational awareness, and so on, I will increase my p(doom).
While publicizing the risks of AGI is directionally beneficial, I worry that focusing too much on classical misalignment can distract from catastrophic misuse. Even if we achieve traditional alignment, it’s unlikely that jailbreaks will vanish. In which case, we eliminate the cognitive labor and skills needed to develop bioweapons. While this is less likely to lead to extinction or loss of control, it requires a simpler and more likely chain of events: a terrorist group or cult with several hundred thousand or several million of dollars gets an AI to design experiments, which humans run. Since this post is long already, I don’t want to turn this post into a debate of just how likely or possible such misuse is. But if you, like me, consider this type of scenario concerning (consider the Aum Shinrikyo cult), then the focus on misalignment rather than misuse in the book might seem suboptimal. And since misuse scenarios require fewer logical steps, they might be more compelling for a general audience who aren’t willing to consider Yudkowsky’s arguments step-by-step. A general audience would certainly care about a loss of tens to hundreds of millions of lives! I am nitpicking here, especially since any argument toward AI safety likely causes people to update in that direction. If I make a case to be concerned about misuse, that tends to marginally encourage AI safety efforts, which tend to be useful for both alignment and misuse scenarios.
Again, I agree with Yudkowsky that alignment is important and current AI capabilities work is risky. But there are better, more convincing, and less speculative arguments to be made, and using slightly updated versions of abstract arguments from 20 years ago does not make the strongest case. A different version of this book could contend with the real arguments made by those concerned about AI safety and actively researching it, and it would be less speculative and more convincing. While I recognize that lab insiders have reasons to understate their expected risks, it should be informative that even safety researchers tend to have risks in the low 10’s of percentages. It’s hard to reach 90% p(doom) when you’re in the field and interacting with AI on a daily basis; that is a feature and not a bug.