Sunday, March 30, 2025
Google search engine
HomeTechnologyUsHow This Tool Could Decode AI’s Inner Mysteries

How This Tool Could Decode AI’s Inner Mysteries


The scientists didn’t have high expectations when they asked their AI model to complete the poem. “He saw a carrot and had to grab it,” they prompted the model. “His hunger was like a starving rabbit,” it replied. 

The rhyming couplet wasn’t going to win any poetry awards. But when the scientists at AI company Anthropic inspected the records of the model’s neural network, they were surprised by what they found. They had expected to see the model, called Claude, picking its words one by one, and for it to only seek a rhyming word—“rabbit”—when it got to the end of the line.

Instead, by using a new technique that allowed them to peer into the inner workings of a language model, they observed Claude planning ahead. As early as the break between the two lines, it had begun “thinking” about words that would rhyme with “grab it,” and planned its next sentence with the word “rabbit” in mind.

The discovery ran contrary to the conventional wisdom—in at least some quarters—that AI models are merely sophisticated autocomplete machines that only predict the next word in a sequence. It raised the questions: How much further might these models be capable of planning ahead? And what else might be going on inside these mysterious synthetic brains, which we lack the tools to see?

The finding was one of several announced on Thursday in two new papers by Anthropic, which reveal in more depth than ever before how large language models (LLMs) “think.”

Today’s AI tools are categorically different from other computer programs for one big reason: they are “grown,” rather than coded by hand. Peer inside the neural networks that power them, and all you will see is a bunch of very complicated numbers being multiplied together, again and again. This internal complexity means that even the machine learning engineers who “grow” these AIs don’t really know how they spin poems, write recipes, or tell you where to take your next holiday. They just do.

But recently, scientists at Anthropic and other groups have been making progress in a new field called “mechanistic interpretability”—that is, building tools to read those numbers and turn them into explanations for how AI works on the inside. “​​What are the mechanisms that these models use to provide answers?” says Chris Olah, an Anthropic cofounder, of the questions driving his research. “What are the algorithms that are embedded in these models?” Answer those questions, Olah says, and AI companies might be able to finally solve the thorny problem of ensuring AI systems always follow human rules.

The results announced on Thursday by Olah’s team are some of the clearest findings yet in this new field of scientific inquiry, which might best be described as a kind of “neuroscience” for AI.

A new ‘microscope’ for looking inside LLMs

In earlier research published last year, Anthropic researchers identified clusters of artificial neurons within neural networks. They called them “features,” and found that they corresponded to different concepts. To illustrate this finding, Anthropic artificially boosted a feature inside Claude corresponding to the Golden Gate Bridge, which led the model to insert mention of the bridge, no matter how irrelevant, into its answers until the boost was reversed.

In the new research published Thursday, the researchers go a step further, tracing how groups of multiple features are connected together inside a neural network to form what they call “circuits”—essentially algorithms for carrying out different tasks.

To do this, they developed a tool for looking inside the neural network, almost like the way scientists can image the brain of a person to see which parts light up when thinking about different things. The new tool allowed the researchers to essentially roll back the tape and see, in perfect HD, which neurons, features, and circuits were active inside Claude’s neural network at any given step. (Unlike a biological brain scan, which only gives the fuzziest picture of what individual neurons are doing, digital neural networks provide researchers with an unprecedented level of transparency; every computational step is laid bare, waiting to be dissected.)

When the Anthropic researchers zoomed back to the beginning of the sentence, “His hunger was like a starving rabbit,” they saw the model immediately activate a feature for identifying words that rhyme with “it.” They identified the feature’s purpose by artificially suppressing it; when they did this and re-ran the prompt, the model instead ended the sentence with the word “jaguar.” When they kept the rhyming feature but suppressed the word “rabbit” instead, the model ended the sentence with the feature’s next top choice: “habit.”

Anthropic compares this tool to a “microscope” for AI. But Olah, who led the research, hopes that one day he can widen the aperture of its lens to encompass not just tiny circuits within an AI model, but the entire scope of its computation. His ultimate goal is to develop a tool that can provide a “holistic account” of the algorithms embedded within these models. “I think there’s a variety of questions that will increasingly be of societal importance, that this could speak to, if we could succeed,” he says. For example: Are these models safe? Can we trust them in certain high-stakes situations? And when are they lying?

Universal language

The Anthropic research also found evidence to support the theory that language models “think” in a non-linguistic statistical space that is shared between languages.

Anthropic scientists tested this by asking Claude for the “opposite of small” in several different languages. Using their new tool, they analyzed the features that activated inside Claude when it answered each of those prompts in English, French, and Chinese. They found features corresponding to the concepts of smallness, largeness, and oppositeness, which activated no matter what language the question was posed in. Additional features would also activate corresponding to the language of the question, telling the model what language to answer in. 

This isn’t an entirely new finding—AI researchers have conjectured for years that language models “think” in a statistical space outside of language, and earlier interpretability work has borne this out with evidence. But Anthropic’s paper is the most detailed account yet of exactly how this phenomenon happens inside a model, Olah says. 

The finding came with a tantalizing prospect for safety research. As models get larger, the team found, they tend to become more capable of abstracting ideas beyond language and into this non-linguistic space. This finding could be useful in a safety context, because a model that is able to form an abstract concept of, say, “harmful requests” is more likely to be able to refuse them in all contexts, compared to a model that only recognizes specific examples of harmful requests in a single language.

This could be good news for speakers of so-called “low-resource languages” that are not widely represented in the internet data that is used to train AI models. Today’s large language models often perform more poorly in those languages than in, say, English. But Anthropic’s finding raises the prospect that LLMs may one day not need unattainably vast quantities of linguistic data to perform capably and safely in these languages, so long as there is a critical mass big enough to map onto a model’s internal non-linguistic concepts.

However, speakers of those languages will still have to contend with how those very concepts have been shaped by the dominance of languages like English, and the cultures that speak them.

Toward a more interpretable future

Despite these advances in AI interpretability, the field is still in its infancy, and significant challenges remain. Anthropic acknowledges that “even on short, simple prompts, our method only captures a fraction of the total computation” expended by Claude—that is, there is much going on inside its neural network into which they still have zero visibility. “It currently takes a few hours of human effort to understand the circuits we see, even on prompts with only tens of words,” the company adds. Much more work will be needed to overcome those limitations.

But if researchers can achieve that, the rewards might be vast. The discourse around AI today is very polarized, Olah says. At one extreme, there are people who believe AI models “understand” just like people do. On the other, there are people who see them as just fancy autocomplete tools. “I think part of what’s going on here is, people don’t really have productive language for talking about these problems,” Olah says. “Fundamentally what they want to ask, I think, is questions of mechanism. How do these models accomplish these behaviors? They don’t really have a way to talk about that. But ideally they would be talking about mechanism, and I think that interpretability is giving us the ability to make much more nuanced, specific claims about what exactly is going on inside these models. I hope that that can reduce the polarization on these questions.”



Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments