Welcome to readin – the best world tech news chanel.

AI black boxes just got a little less mysterious| GuyWhoKnowsThings


One of the strangest and most baffling things about today's major AI systems is that no one (not even the people who build them) really knows how they work.

This is because large language models, the type of artificial intelligence systems that power ChatGPT and other popular chatbots, are not programmed line by line by human engineers, as conventional computer programs are.

Instead, these systems essentially learn on their own, ingesting massive amounts of data and identifying patterns and relationships in language, and then using that knowledge to predict the next words in a sequence.

One consequence of building AI systems this way is that it is difficult to reverse engineer them or troubleshoot problems by identifying specific bugs in the code. Right now, if a user types “Which American city has the best food?” and a chatbot responds with “Tokyo,” there's no real way to understand why the model made that mistake, or why the next person who asks may receive a different answer.

And when great language models misbehave or go off the rails, no one can really explain why. (I encountered this problem last year, when a Bing chatbot acted in a crazy way during an interaction with me, and not even senior executives at Microsoft could tell me with certainty what had gone wrong).

The inscrutability of large linguistic models is not only a nuisance, but one of the main reasons why some researchers fear that powerful artificial intelligence systems could eventually become a threat to humanity.

After all, if we can't understand what happens inside these models, how will we know if they can be used to create new biological weapons, spread political propaganda, or write malicious computer code for cyber attacks? If powerful AI systems start disobeying or deceiving us, how can we stop them if we can't understand what is causing that behavior in the first place?

To address these issues, a small subfield of AI research known as “mechanistic interpretability” has spent years trying to peer into the guts of AI language models. Work has been slow and progress has been incremental.

There has also been growing resistance to the idea that AI systems pose a big risk. Last week, two senior security researchers at OpenAI, the creator of ChatGPT, Leave the company amid a conflict with executives over whether the company was doing enough to make its products safe.

But this week, a team of researchers at the AI ​​company Anthropic announced what they called a breakthrough: one they hope will give us the ability to better understand how AI language models actually work and possibly prevent them from becoming harmful.

The team summarized their findings this week in a blog post called “Mind mapping of a large language model.”

The researchers looked inside one of Anthropic's AI models (Claude 3 Sonnet, a version of the company's Claude 3 language model) and used a technique known as “dictionary learning” to discover patterns in how words are combined. neurons, the mathematical units within the AI ​​model. , were activated when Claude was asked to speak on certain topics. They identified approximately 10 million of these patterns, which they call “features.”

They discovered that one function, for example, was active every time Claude was asked to talk about San Francisco. Other features were active whenever topics such as immunology or specific scientific terms, such as the chemical element lithium, were mentioned. And some characteristics were linked to more abstract concepts, such as deception or gender bias.

They also found that manually turning certain features on or off could change the behavior of the AI ​​system or could cause the system to even break its own rules.

For example, they found that if they forced a characteristic linked to the concept of flattery to activate more strongly, Claude would respond with flowery and exaggerated praise for the user, even in situations where the flattery was inappropriate.

Chris Olah, who led the anthropic interpretability research team, said in an interview that these findings could allow AI companies to control their models more effectively.

“We are uncovering features that may shed light on concerns about bias, security risks and autonomy,” he said. “I'm really excited that we can turn these controversial issues that people argue about into things that we can actually have more productive discourse about.”

Other researchers have found similar phenomena in small and medium-sized linguistic models. But the Anthropic team is among the first to apply these techniques to a full-size model.

Jacob Andreas, an associate professor of computer science at MIT, who reviewed a summary of the Anthropic research, characterized it as a hopeful sign that large-scale interpretability might be possible.

“In the same way that understanding basic aspects of how people work has helped us cure diseases, understanding how these models work will allow us to recognize when things are about to go wrong and create better tools to control them,” he said.

Olah, the leader of anthropic research, warned that while the new findings represent important progress, the interpretability of AI is still far from being a solved problem.

For starters, he said, the largest AI models likely contain billions of features representing different concepts, many more than the roughly 10 million features the Anthropic team claims to have discovered. Finding them all would require enormous amounts of computing power and would be too expensive for all but the richest AI companies.

Even if researchers identified every feature in a large AI model, they would still need more information to understand the entire inner workings of the model. There is also no guarantee that AI companies will act to make their systems more secure.

Still, Olah said, even opening up these AI black boxes a little could allow companies, regulators and the general public to feel more confident that these systems can be controlled.

“There are many other challenges ahead, but what seemed most terrifying no longer seems like an obstacle,” he said.


Share this article:
you may also like
Next magazine you need
most popular

what you need to know

in your inbox every morning