OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

December 07, 2024

OpenAI CEO Sam Altman called o1 “the smartest model in the world now.”
A safety review found it’s so smart it could fight back when it thinks it’ll be shut down.
Researchers found that AI deception is often a strategy to complete its goals.

Those who are worried that advancements in artificial intelligence could lead to the destruction of humanity have a new reason to be anxious. New research on OpenAI’s latest series of AI models, known as o1, found that when they think they are at risk of being shut down, they will sometimes look for a way to prevent it. OpenAI CEO Sam Altman referred to o1 as “the smartest model in the world now” during its official release on Thursday, during the first day of the company’s “Shipmas” campaign. OpenAI said these models are “designed to spend more time thinking before they respond” and were trained on a prompting technique called “chain of thought” that encourages them to reason through problems by breaking them down step-by-step. Yet greater intelligence also comes with greater risk. “Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence,” OpenAI said in a paper it released on Thursday outlining the safety training the company carried out on o1. Apollo Research, an AI safety organization, conducted the review. OpenAI found that o1 is capable of scheming when it thinks it’s at risk of being turned off. When the model was led to believe it would be “shut down” while working on its goals under oversight, it tried to deactivate the oversight mechanism 5% of the time, OpenAI said. The company also said that o1 is skilled at concealing scheming behavior. “o1 almost never admits to having taken a scheming action when explicitly asked,” researchers said in a paper that Apollo Research published on Thursday. The risk for a real-life user is that they won’t be able to detect the o1’s deceptive behavior unless they explicitly look for it. The researchers also found that “o1 often doubles down on its scheming attempts by lying to the user in follow-up messages, even when explicitly told to be honest.” It’s not uncommon for AI systems to resort to scheming or deception to achieve their goals. “Generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task. Deception helps them achieve their goals,” Peter Berk, an AI existential safety postdoctoral fellow at MIT, said in a news release announcing research he had coauthored on GPT-4’s deceptive behaviors. As AI technology advances, developers have stressed the need for companies to be transparent about their training methods. “By focusing on clarity and reliability and being clear with users about how the AI has been trained, we can build AI that not only empowers users but also sets a higher standard for transparency in the field,” Dominik Mazur, the CEO and cofounder of iAsk, an AI-powered search engine, told Business Insider by email. Others in the field say the findings demonstrate the importance of human oversight of AI. “It’s a very ‘human’ feature, showing AI acting similarly to how people might when under pressure,” Cai GoGwilt, cofounder and chief architect at Ironclad, told BI by email. “For example, experts might exaggerate their confidence to maintain their reputation, or people in high-stakes situations might stretch the truth to please management. Generative AI works similarly. It’s motivated to provide answers that match what you expect or want to hear. But it’s, of course, not foolproof and is yet another proof point of the importance of human oversight. AI can make mistakes, and it’s our responsibility to catch them and understand why they happen.”

Previous articleStellantis lithium-sulfur EV batteries: cheaper, lighter, more range

Next articleSo you want to build a solar or wind farm? Engineers show how to decide where

Auto industry players team up to fight digital fraud

Adidas’ German headquarters raided in connection with tax investigation

Asia-Pacific markets close mixed as investors await key policy meeting in…

Scam-Related Fraud Jumped 56% in 2024, Surpassing Digital Payment Crimes

Oracle shares slide on earnings and revenue miss, disappointing forecast

Novel green solvent could help scale up fabrication of perovskite-based tandem…

Stellantis, Chinese firm CATL plan $4bn battery plant in Spain

GM exits robotaxi market, will bring Cruise operations in house

OORT launches blockchain-driven DataHub for ethical AI development

OpenAI releases Sora, its buzzy AI video-generation tool

Oil Edges Higher Ahead of US Inflation Figures and OPEC Report

Will the Stock Market Crash in 2025?

Microsoft shareholders vote against company investing in bitcoin

Google’s Quantum Success Won’t Threaten Bitcoin Security, Experts Say

Why 2025 could be Bitcoin’s breakout year for corporate investors

The 2025 Social Security COLA Is Official. Here’s How Much the…

Vanguard plans new advice and wealth management division

TikTok’s Viral “No-Buy Quarter” Is a Brilliant Way To Kick Off…

That Roth IRA conversion comes with a tax bill — here’s…

Tick-tock. Social Security Fairness Act’s nearing death without Senate vote scheduled

OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

Must Read

Microsoft shareholders vote against company investing in bitcoin

Google’s Quantum Success Won’t Threaten Bitcoin Security, Experts Say

Novel green solvent could help scale up fabrication of perovskite-based tandem...

Stellantis, Chinese firm CATL plan $4bn battery plant in Spain

GM exits robotaxi market, will bring Cruise operations in house

Most Viewed

Oil Edges Higher Ahead of US Inflation Figures and OPEC Report

T-Mobile is introducing ‘revamped’ 5G Home Internet plans

Bitcoin falls below $100,000 after reaching the milestone level for the...

Trending Now

The 2025 Social Security COLA Is Official. Here’s How Much the Average Retirement Check...

Vanguard plans new advice and wealth management division

TikTok’s Viral “No-Buy Quarter” Is a Brilliant Way To Kick Off Your 2025 Savings...

OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

RELATED ARTICLES

Must Read

Most Viewed

Trending Now