Despite being man-made, immense language models are still quite mysterious. The high-octane algorithms that power our current artificial intelligence boom can do things that are externally inexplicable to the humans watching them. That’s why artificial intelligence was largely there called a “black box”, a phenomenon that cannot be easily understood from the outside.
Newly published research from Anthropic, one of the leading companies in the artificial intelligence industry, attempts to shed some lithe on the more confusing aspects of AI’s algorithmic behavior. Anthropic published an article on Tuesday research article designed to explain why its AI chatbot, Claude, prefers to generate content on certain topics over others.
Artificial intelligence systems are set roughly the human brain – layered neural networks that take in and process information and then make “decisions” or predictions based on that information. Such systems are “trained” on immense subsets of data, which allows them to create algorithmic connections. However, when AI systems generate data based on their training, observers do not always know how the algorithm obtained this output.
This mystery gave birth to the field “AI Interpretation” in which researchers attempt to trace a machine’s decision-making path to understand its outcomes. In the field of AI interpretation, a “feature” refers to a pattern of activation “neurons” in a neural network – this is actually a concept that the algorithm can refer to. The more “features” of a neural network a researcher can understand, the better he or she can understand how specific inputs cause the network to influence specific outputs.
IN note based on their findings, Anthropic researchers explain how they used a process called “dictionary learning” to decipher which parts of Claude’s neural network correspond to specific concepts. Using this method, the researchers say they were able to “begin to understand the model’s behavior by seeing which features respond to specific inputs, which gave us insight into the model’s ‘rational’ for how it arrived at a given response.”
In an interview with the Anthropic research team by Steven Levy of Wiredthe staff explained what it was like to decipher how Claude’s “brain” worked. Once they figured out how to decipher one function, it led to others:
One feature that stood out to them was related to the Golden Gate Bridge. They mapped a set of neurons that, when fired together, indicated that Claude was “thinking” about a huge structure connecting San Francisco to Marin County. Moreover, when similar sets of neurons were fired, they recalled objects adjacent to the Golden Gate Bridge: Alcatraz, California Governor Gavin Newsom, and a Hitchcock film Dizziness, which takes place in San Francisco. By all accounts, the team identified millions of features – a sort of Rosetta Stone for decoding Claude’s neural network.
It is essential to note that Anthropic, like other for-profit companies, may have some business motivation for writing and publishing its research in the way it does. That said, he’s part of the team the paper is publicmeaning you can read it yourself and draw your own conclusions about the findings and methodology.