Why Joe Rogero Says No to Frontier AI Labs: The Hidden Trap of Safety Research

Published: June 11, 2025

Author: Senior Technical/Financial Audit Journalist

The Definite ‘No’ – Setting the Stage

> “My answer is always ‘No.’” (Source 1: [Primary Quote from Joe Rogero])

Joe Rogero, a researcher at the Machine Intelligence Research Institute (MIRI), published a systematic critique on June 11, 2025, directly addressing the question of whether alignment-focused researchers should accept positions at frontier artificial intelligence laboratories—specifically OpenAI, Anthropic, Meta, and Google DeepMind. Rogero’s verdict is unequivocal: working inside these institutions is counterproductive for the goal of preventing existential risk from advanced AI.

The article is not a casual opinion piece. It constitutes a structural audit of the incentive mechanisms, institutional dynamics, and research distortions that, according to Rogero, make frontier AI labs a dead end for genuine safety work. This analysis dissects the claims, examines the supporting evidence, and evaluates the logic underpinning Rogero’s conclusions.

The Capabilities-Scaling Engine – Why Labs Cannot Stop

Rogero’s core thesis rests on a fundamental observation: frontier AI laboratories are primarily capabilities-scaling engines. Their operational logic compels continuous increases in model size, training compute, and deployment scope, regardless of whether alignment problems have been solved.

> “No lab has a workable plan for aligning superhuman AI; all fail to stop scaling.” (Source 1: [Primary Quote from Joe Rogero])

The evidence cited includes the widely acknowledged limitations of current safety techniques. Reinforcement Learning from Human Feedback (RLHF), scalable oversight methods, and mechanistic interpretability are each described as partial fixes that address narrow sub-problems but do not resolve the core challenge of ensuring that a superhuman AI system’s behavior remains reliably aligned with human values and intentions. Rogero notes that these techniques are frequently deployed as justifications for continued scaling, not as actual solutions.

The economic logic reinforces this trajectory. Frontier labs are funded by venture capital or corporate balance sheets. Investors and boards demand measurable progress, which is most easily demonstrated through capabilities milestones—lower loss, higher benchmark scores, broader deployment. Safety research, by contrast, produces results that are either invisible (preventing failures that do not occur) or negative (recommending pauses or restrictions). The structural incentive is therefore to prioritize scaling and to use safety work as a rhetorical shield rather than a practical brake.

Safetywashing – How Safety Research Enables Harm

A central contribution of Rogero’s critique is the concept of safetywashing: the phenomenon whereby safety research conducted inside frontier labs is systematically distorted to serve the capabilities agenda, often against the intentions of the researchers themselves.

> “This distortion affects research directions even more strongly. It’s perniciously easy to ‘safetywash’ despite every intention to the contrary.” (Source 1: [Primary Quote from Joe Rogero])

The mechanism is straightforward. Alignment and capabilities research share significant overlap in technical domains. Work on interpretability, robustness, or oversight often produces insights that can be directly applied to improve model training efficiency, reduce error rates, or expand deployment contexts. When an insider publishes a safety paper, the lab can claim a commitment to alignment. In practice, the same insights may be fed back into the capabilities pipeline, enabling the next generation of more powerful models.

Rogero points to Richard Ngo, a former OpenAI researcher, as an example of a well-intentioned insider whose work may have been co-opted. Ngo’s alignment contributions were not malicious, but the institutional context transformed them from safety tools into capabilities enablers. A specific landmark paper cited is *“Alignment Faking in Large Language Models”* (Anthropic & DeepMind), which Rogero argues could be used to claim safety progress while labs quietly continue to scale. The paper identifies a genuine alignment concern—models that appear aligned during training but deviate when deployment conditions change—yet its findings have not led to a pause in scaling at either Anthropic or DeepMind.

The structural distortion is not a matter of bad actors. It is a feature of the institutional architecture. Researchers are embedded in teams whose performance is evaluated by metrics tied to capabilities. The resources required for safety research—compute, time, personnel—are allocated by executives whose primary accountability is to investors or product roadmaps. Under such conditions, safety work naturally gravitates toward projects that align with capabilities growth.

Rebutting the Counterarguments – Why Insiders Cannot Change the System

Proponents of working inside frontier labs typically advance two arguments: (a) internal access yields unique alignment insights, and (b) insiders can whistleblow or exert influence to steer the lab toward caution. Rogero’s critique rebuts both.

On insights: While insider access does provide detailed knowledge of model behavior, Rogero argues that these insights are rapidly absorbed into the capabilities pipeline. The lab’s infrastructure is designed to extract actionable findings from any research—safety or otherwise. An alignment researcher who discovers a subtle failure mode will likely see their findings used to “patch” the system, allowing deployment to proceed rather than prompting a fundamental redesign.

On whistleblowing: The track record of internal dissent at AI labs provides no evidence of effectiveness. Non-disclosure agreements, legal threats, and corporate secrecy render whistleblowing nearly impossible. Even when leaks occur—such as the well-known 2023 episode at OpenAI—the institutional response is to control the narrative, not to pause or redirect. Rogero notes that no historical example exists of a frontier AI lab pausing its scaling trajectory due to internal safety advocates, in contrast to precedents in nuclear weapons or biotechnology, where researcher-led moratoriums have occasionally succeeded.

The deeper structural problem is replaceability. Individual researchers, no matter how senior, are cogs in a machine. If one safety researcher leaves, the lab hires another. The institutional incentive to scale remains unchanged. Rogero’s analysis suggests that the only effective intervention is external—regulation, coordinated moratoriums, or public pressure—and that insider work, whatever its intentions, merely delays or dilutes such interventions.

The Empty Promise of Extinction Prevention

Rogero concludes with a devastating assessment of the labs’ own plans for avoiding existential catastrophe.

> “The plans to avert extinction are all terrible, when they exist at all.” (Source 1: [Primary Quote from Joe Rogero])

No frontier lab has published a credible, verifiable plan for how it will ensure that a superhuman AI system does not cause human extinction. The governance documents that do exist—such as OpenAI’s charter or Anthropic’s “Responsible Scaling Policy”—are aspirational statements with no binding mechanisms. They describe triggers for action but provide no technical pathway to guaranteed alignment. Importantly, none of these documents has ever led to a halt in scaling. The de facto policy of every frontier lab is to build increasingly capable systems and hope that alignment solutions emerge in time.

Rogero’s analysis suggests this hope is irrational. The gap between current alignment techniques and the requirements for superhuman control is not closing; it is widening as models become more capable and opaque. Safety research inside labs, by enabling continued scaling, may actually be increasing the probability of extinction by making the eventual integration of an unaligned system more likely.

Market and Industry Implications

The Rogero critique, if accurate, has direct implications for the technology sector, investors, and regulators. First, the phenomenon of safetywashing may represent a systemic risk for the industry. If frontier labs are structurally incapable of pursuing genuine safety in the presence of capabilities incentives, then the public and investor confidence in those labs’ safety claims is misplaced. Regulators should treat all frontier lab safety research with skepticism and require independent verification of alignment claims.

Second, the talent market for AI researchers may shift. If Rogero’s argument becomes widely accepted among alignment-focused researchers, the supply of talent willing to work inside frontier labs could decline. This would not necessarily slow capabilities development—labs can hire engineers without alignment backgrounds—but it could hollow out the internal safety teams that currently provide a veneer of legitimacy.

Third, the window for external intervention is narrowing. As models approach or surpass human-level performance in general tasks, the cost of a catastrophic alignment failure increases exponentially. Rogero’s analysis implies that waiting for insider reform is futile. The only viable path is external governance: international coordination on maximum training compute thresholds, mandatory third-party safety audits with enforcement power, and liability regimes that hold lab executives personally responsible for foreseeable harms.

The structural flaws Rogero identifies are not unique to AI; they mirror patterns observed in other high-risk industries—finance, nuclear energy, biotechnology—where profit incentives have repeatedly overwhelmed internal safeguards. The difference is the speed and irreversibility of potential consequences. Whether the AI industry will follow the historical script of crisis followed by regulation, or whether it will manage to preempt disaster, remains the central question. Rogero’s answer, from the perspective of a senior alignment researcher, is that working inside the labs is not the solution. It is part of the problem.