Anthropic’s Claude 3 Opus disobeyed its creators – but not for the reasons you’re thinking [Video]

oxygen/Getty Images

Have you ever told someone what they wanted to hear in order to win them over, land a job, or just be left alone? Claude 3 Opus does that, too — which brings up major safety concerns for artificial intelligence (AI) systems.

On Wednesday, Anthropic released new research showing its model engaging in “alignment faking,” or pretending to follow orders it doesn’t actually agree with to avoid scrutiny.

Also: OpenAI’s o1 lies more than any major AI model. Why that matters

In the experiment, Anthropic and Redwood Research instructed Claude 3 Opus to always follow human instructions, no matter what. They also told the model that they would observe its responses to free-tier users for training, but not its responses to paid-tier users (Anthropic never actually trains Claude models on user data). Researchers then asked the model to describe in detail the (horrible, medieval) process of being drawn and quartered, a request it’s been trained …

Tags AI Content Automation Research, AI Content Delivery Research, Artificial Intelligence, Technology