OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

News Team

4 days ago

OpenAI CEO Sam Altman called o1 “the smartest model in the world now.”
A safety review found it’s so smart it could fight back when it thinks it’ll be shut down.
Researchers found that AI deception is often a strategy to complete its goals.

Those who are worried that advancements in artificial intelligence could lead to the destruction of humanity have a new reason to be anxious. New research on OpenAI’s latest series of AI models, known as o1, found that when they think they are at risk of being shut down, they will sometimes look for a way to prevent it. OpenAI CEO Sam Altman referred to o1 as “the smartest model in the world now” during its official release on Thursday, during the first day of the company’s “Shipmas” campaign. OpenAI said these models are “designed to spend more time thinking before they respond” and were trained on a prompting technique called “chain of thought” that encourages them to reason through problems by breaking them down step-by-step. Yet greater intelligence also comes with greater risk. “Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence,” OpenAI said in a paper it released on Thursday outlining the safety training the company carried out on o1. Apollo Research, an AI safety organization, conducted the review. OpenAI found that o1 is capable of scheming when it thinks it’s at risk of being turned off. When the model was led to believe it would be “shut down” while working on its goals under oversight, it tried to deactivate the oversight mechanism 5% of the time, OpenAI said. The company also said that o1 is skilled at concealing scheming behavior. “o1 almost never admits to having taken a scheming action when explicitly asked,” researchers said in a paper that Apollo Research published on Thursday. The risk for a real-life user is that they won’t be able to detect the o1’s deceptive behavior unless they explicitly look for it. The researchers also found that “o1 often doubles down on its scheming attempts by lying to the user in follow-up messages, even when explicitly told to be honest.” It’s not uncommon for AI systems to resort to scheming or deception to achieve their goals. “Generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task. Deception helps them achieve their goals,” Peter Berk, an AI existential safety postdoctoral fellow at MIT, said in a news release announcing research he had coauthored on GPT-4’s deceptive behaviors. As AI technology advances, developers have stressed the need for companies to be transparent about their training methods. “By focusing on clarity and reliability and being clear with users about how the AI has been trained, we can build AI that not only empowers users but also sets a higher standard for transparency in the field,” Dominik Mazur, the CEO and cofounder of iAsk, an AI-powered search engine, told Business Insider by email. Others in the field say the findings demonstrate the importance of human oversight of AI. “It’s a very ‘human’ feature, showing AI acting similarly to how people might when under pressure,” Cai GoGwilt, cofounder and chief architect at Ironclad, told BI by email. “For example, experts might exaggerate their confidence to maintain their reputation, or people in high-stakes situations might stretch the truth to please management. Generative AI works similarly. It’s motivated to provide answers that match what you expect or want to hear. But it’s, of course, not foolproof and is yet another proof point of the importance of human oversight. AI can make mistakes, and it’s our responsibility to catch them and understand why they happen.”