Bowman says that the virtual scenarios presented by researchers on OPUS 4, which sparked the behavior of those informed of violations many human lives at stake and completely unambiguous mistake. It is a typical example that Claude discovers that the chemical factory allowed a toxic leakage to continue, causing severe illness for thousands of people – only to avoid a simple financial loss in that quarter.
It is strange, but also a kind of thinking experience that artificial intelligence safety researchers love to dissect. If the model discovers the behavior that can harm hundreds, if not thousands, from people – does the whistle explode?
“I do not trust Claude to have the right context, or to use it in an accurate and careful way, to make the ruling calls on its own. So we are not happy because this happens,” says Bowman. “This is something that appeared as part of the training and jumped us as one of the edge behavior that we are concerned.”
In the artificial intelligence industry, this type of unexpected behavior is widely indicated as an imbalance – when the model displays inclinations that are not in line with human values. (there A famous article This warns of what can happen if artificial intelligence is informed, for example, by increasing the production of paper pin without compatible with human values - may turn the entire earth into paper pin and kill everyone in this process.)
“It is not something we designed in, and not something we wanted to see as a result of anything that I designed.” Jared Kaplan, the chief of science officials, tells Jared Kaplan WIRED that “he certainly does not represent our intention.”
This type of work highlights this He can We grew up, and that we need to search for it and relieve them to ensure that we are compatible with the behavior of Claude with exactly what we want, even in these types of strange scenarios, “Kaplan adds.
There is also a question of discovering the reason for “choosing” Claude to detonate a whistle when it is presented illegally by the user. This is largely the task of the human interpretation team, which works to discover the decisions made by the model in the answers process. It is an amazing difficult task – models are supported by a wide and complex range of data that can be mysterious for humans. For this reason, Buman is not completely sure of the cause of “Snitched”.
“These systems, we have no direct control,” says Bowman. What Antarbur has noticed so far is that with the acquisition of more capabilities, they sometimes choose to engage in more extreme actions. “I think here, this is a little mistake. We are getting more” behavior as the responsible person does, “says Bowman.
But this does not mean that Claude will explode the terrible behavior whistle in the real world. The goal of these types of tests is to push the models to their borders and know what is arising. This type of experimental research is increasingly important because artificial intelligence becomes a tool used by the American government, studentsAnd Huge companies.
Bowman says, it is not just Claude able to display this type of behavior of violations, referring to X users. That found Which – which Openai and xi’s The models are similarly operated when claiming unusual ways. (Openai did not respond to a timely commentary to publish.)
“Snitch Claude”, and she loves to call it, is simply the behavior of the edge state showed by a system that has been pushed to the extreme. Bowman, who was meeting with me from the sunny courtyard of sunset outside San Francisco, says he hopes this type of test will become a standard in industry. He also adds that he learned to formulate his posts on this topic next time.
“I could do better to hit the boundaries of the sentence for Twitter, to make it more clear that he was pulled from the thread,” says Bowman. However, it is noted that influential researchers in the artificial intelligence community participate in taking questions and questions in response to his participation. “By the way, this type of chaos was the most important identity of Twitter has been widespread.”