
[EDRM Editor’s Note: EDRM is proud to publish Ralph Losey’s advocacy and analysis. The opinions and positions are Ralph Losey’s copyrighted work. All images in this article were created by Ralph Losey using AI. This article is published here with permission.]
Dario Amodei, Chief Scientist and CEO of Anthropic, has written another important article you should read: The Urgency of Interpretability. He is very concerned that scientists have created a powerful new technology that no one fully understands. It is like alien technology and so reminds me of the black monoliths in Stanley Kubrick’s movie: 2001: A Space Odyssey. The message of Amodei’s essay is that we must be able to peer into the black monoliths of AI, and soon, or who knows what may happen.

An Old Problem Suddenly Becomes Urgent
This is not a new problem. We have never really understood how generative AI works like we do all other computer code. For example, if a character in a video game using old code said something, or your delivery app suggested a tip, someone wrote those specific lines of code. The human programmer made it happen. Generative AI though, is different. When an AI summarizes a dense document or writes a poem, the reasoning isn’t laid out in neat steps that we can easily follow. We don’t know the details of what it is doing. As Amodei puts it:
People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology. For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model. This goal has often felt very distant, but multiple recent breakthroughs have convinced me that we are now on the right track and have a real chance of success.
This is an old problem, but now, all of a sudden, it has become an emergency. We need an AI MRI, and we need it now! Why? Because this alien tech is progressing much faster than Amodei ever thought possible. He thinks that his company, Anthropic, and others, could reach AGI levels as soon as 2026 or 2027. Since he says pausing AI advancement is impossible and provides good global security reasons for that, we must at least remove some of its veils. We have to crack some of the mysteries and peer into the monoliths to figure them out. They may not be 100% benign.
We just don’t know because we don’t really know how they work. That is the danger. What will happen when we seize the cheese of AGI as this image suggests? The AGI bait is very tempting but better look at the strange tech carefully before you go for it.

We Need an AI MRI
Amodei has some good news to report, we are starting to be able to peer inside of AI because of breakthroughs in mechanistic interpretability, such as the identification of features and circuits. He thinks this offers a promising path towards a comprehensive ‘AI MRI.’ Only then can Amodei breathe easy. With an AI MRI maybe even the father of generative AI, Nobel Prize winner, Geoffrey Hinton, can start to smile.
Right now, Hinton, the father of AI, seems to be the most terrified scientist of them all. He recently said: “the best way to understand it emotionally is we are like somebody who has this really cute tiger cub, unless you can be very sure that it’s not gonna want to kill you when it’s grown up, you should worry.” Although Hinton says its really just a wild guess, he finds himself agreeing with Elon Musk that it’s “sort of 10% to 20% chance that these things will take over.“

The dangers of not knowing how it works seem obvious to some people, but not all. Amodei reports that it is hard to build consensus to focus on a danger that’s speculative, one that you can’t clearly point to and say, “Look, here’s the concrete proof.” That’s especially true when the unexpected negative behaviors we have seen so far, such as sycophantism, are relatively mild, not catastrophic. Further, many emergent abilities have been very good. Still, the uncertainty risk grows larger as AI advances. I agree with Amodei and urge scientists and coders to create AI MRI and do so soon to protect humanity from unintended consequences of AGI.

One of the goals of AI MRIs that Amodei and others are working on is to catch models red-handed, to actually see those internal motivations if they exist. In his words:
To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.

Amodei and others have already created early, still primitive versions of AI MRIs, but he is hopeful that with AI help they can start to see what is really going on. Is everything good in there, or do you see a little devil? To quote Amodei:
Our long-run aspiration is to be able to look at a state-of-the-art model and essentially do a “brain scan:” a checkup that has a high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more. This would then be used in tandem with the various techniques for training and aligning models, a bit like how a doctor might do an MRI to diagnose a disease, then prescribe a drug to treat it, then do another MRI to see how the treatment is progressing, and so on.8 It is likely that a key part of how we will test and deploy the most capable models (for example, those at AI Safety Level 4 in our Responsible Scaling Policy framework) is by performing and formalizing such tests.
Our best path forward is with techniques like this, combined with heavy doses of human genius and inspiration. Amodei’s essay on dangers and risks of AI is very different from his prior essay on the wonders and benefits of what he called AI’s Loving Grace. See my article: Dario Amodei’s Vision: A Hopeful Future ‘Through AI’s Loving Grace,’ Is Like a Breath of Fresh Air (11/01/24). He is balanced, a true scientist-magician who, although also a CEO, is nobody’s fool. We need more like him in the AI industry. Will he save the day and figure out how the alien tech works that Hinton conjured up? Let’s hope so.

Beyond the AI MRI Solution
Beyond the AI MRI technical solution, Amodei proposed adoption of three important policies:
- Aggressive Interpretability R&D. Put sustained, top‑tier research funding and talent into “AI‑MRI” methods that expose exactly how advanced models represent concepts and make decisions, so we can verify safety before capabilities run loose.
- Light‑Touch Transparency Rules. Adopt minimalist, disclosure‑focused regulations—think nutrition labels for AI—that require labs to publish safety policies and risk assessments without stifling innovation with heavy bureaucracy.
- Export‑Control “Breathing Room.” Use targeted semiconductor and compute‑capability export limits to slow the global proliferation of cutting‑edge AI hardware just long enough for democracies to finish building robust safety guardrails.
Amodei argues that these policies should be followed to keep democracies ahead of foreign totalitarian government while we figure out the black box problem. These recommendations deserve equal billing with the MRI metaphor because they are actionable today. The chip export controls should buy humanity a critical two‑year margin in the interpretability race. In Dario Amodei’s words:
I’ve long been a proponent of export controls on chips to China because I believe that democratic countries must remain ahead of autocracies in AI. But these policies also have an additional benefit. If the US and other democracies have a clear lead in AI as they approach the “country of geniuses in a datacenter,” we may be able to “spend” a portion of that lead to ensure interpretability10 is on a more solid footing before proceeding to truly powerful AI, while still defeating our authoritarian adversaries11. Even a 1- or 2-year lead, which I believe effective and well-enforced export controls can give us, could mean the difference between an “AI MRI” that essentially works when we reach transformative capability levels, and one that does not. One year ago we couldn’t trace the thoughts of a neural network and couldn’t identify millions of concepts inside them; today we can. By contrast, if the US and China reach powerful AI simultaneously (which is what I expect to happen without export controls), the geopolitical incentives will make any slowdown at all essentially impossible.
Amodei is very concerned regarding the risk of military conflict in the race for AGI and soon thereafter. This may depend on whether an authoritarian military regime acquires significant superintelligence on weapons first and see an advantage in first strike. Regardless, Taiwan is seen by many as a likely war zone because of the unique AI chip manufacturing facilities of TMC.

Generative AI is Grown, Not Built
Amodei likes to explain the black box problem in an analogy, generative AI systems are grown more than they are built:
As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed. It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth1, but the exact structure which emerges is unpredictable and difficult to understand or explain. Looking inside these systems, what we see are vast matrices of billions of numbers. These are somehow computing important cognitive tasks, but exactly how they do so isn’t obvious.

This AI may be home grown, but it is still alien because we don’t understand how it operates. This worries deep thinkers like Amodei. They are uncomfortable building and quickly improving a technology to AGI level that they don’t fully understand. AI might pursue its goals in ways that are harmful to us. It’s not like traditional software where you would have to deliberately code a program to be deceptive. It could just happen as a side effect of trying to be good at its main task.
Since we can’t directly see inside, we cannot observe deceitful thoughts if they were forming. We cannot predict how AI’s internal mechanisms will react in every situation. Can we really trust it? Heavens no! But how do we verify it is pro-human and remains that way? How do we know it has a heart, not a devil?

Conclusion: Vigilant Hope in a Transformative Decade
We stand on the cusp of models so capable that Anthropic’s CEO likens them to “a country of geniuses in a datacenter.” That prospect rightly sparks awe—and a twinge of vertigo. History teaches that powerful inventions rarely announce their darker side in advance; early warning signs are subtle: models that explain poorly, policies that postpone transparency “until the next release”, or economic incentives that outpace safety budgets. When you see those cracks—call them out.
Yet the same ingenuity that birthed generative AI is now inventing its own antidote. Breakthroughs in mechanistic interpretability show we can already spotlight millions of hidden concepts and even throttle rogue obsessions intentionally triggered by implanted bugs. Policy makers are awakening too: export‑control buffers, disclosure mandates, and red‑team MRIs are entering the conversation.
The last sentence in Mario Amodei essay says it well: Powerful AI will shape humanity’s destiny, and we deserve to understand our own creations before they radically transform our economy, our lives, and our future.
Who are you AI? AGI seems so promising but we don’t really know. Is this a trap? Will we be able to enjoy the cheese, get clobbered by a hidden spring or jump away at the last minute?

I feel like concluding with a poem, one that I prompted from a still-far-from-AGI, AI, namely Chat GPT 4o. It is shown below in another AI image I prompted using Visual Muse and Photoshop.

The last words go, as usual, to the Gemini twin podcasters that summarize the article as best they can with their still tiny, but useful brains. Echoes of AI: Dario Amodei Warns of the Danger of Black Box AI that No One Understands.” Hear two fake podcasters talk about this article for about 13 minutes. They wrote the podcast, not me.

Ralph Losey Copyright 2025 – All Rights Reserved
Assisted by GAI and LLM Technologies per EDRM GAI and LLM Policy.