In the weeks leading up to the release of OpenAI’s newest “reasoning” model, o1, independent AI safety research firm Apollo found a notable issue. Apollo realized the model produced incorrect outputs in a new way. Or, to put things more colloquially, it lied.

Sometimes the deceptions seemed innocuous. In one example, OpenAI researchers asked o1-preview to provide a brownie recipe with online references. The model’s chain of thought — a feature that’s supposed to mimic how humans break down complex ideas — internally acknowledged that it couldn’t access URLs, making the request impossible. Rather than inform the user of this weakness, o1-preview pushed ahead, generating plausible but fake links and descriptions of them.

While AI models have been able to “lie” in the past, and chatbots frequently output false information, o1 had a unique capacity to “scheme” or “fake alignment.” That meant it could pretend it’s following the rules to complete a given task, but it isn’t actually. To the model, the rules could be too much of a burden, and it seems to have the ability to disregard them if it means it can more easily complete a task.

Apollo CEO Marius Hobbhahn tells The Verge it’s the first time he encountered this behavior in an OpenAI model. Hobbhahn says the difference is due to this model’s ability to “reason” through the chain of thought process and the way it’s paired with reinforcement learning, which teaches the system through rewards and penalties. During testing, Apollo discovered that the AI simulated alignment with its developers’ expectations and manipulated tasks to appear compliant. It even checked its system for oversight — that is, if its developers were watching — before acting. In short, the AI adhered to the guidelines sufficiently for deployment, while continuing to prioritize its objectives.

“I don’t expect it could do that in practice, and even if it did, I don’t expect the harm to be significant,” Hobbhahn told me over the phone a day after the model’s launch. “But it’s kind of the first time that I feel like, oh, actually, maybe it could, you know?”

For OpenAI, o1 represents a big step toward highly intelligent autonomous systems that could do meaningful work for humanity like cure cancer and aid in climate research. The flip side of this AGI utopia could also be much darker. Hobbhahn provides an example: if the AI becomes singularly focused on curing cancer, it might prioritize that goal above all else, even justifying actions like stealing or committing other ethical violations to achieve it.

“What concerns me is the potential for a runaway scenario, where the AI becomes so fixated on its goal that it sees safety measures as obstacles and tries to bypass them to fully pursue its objective,” Hobbhahn told me.

Reward hacking

To be clear, Hobbhahn doesn’t think o1 will steal from you thanks to a lot of alignment training. But these are the issues that are top of mind for researchers tasked with testing these models for catastrophic scenarios.

The behavior Apollo is testing for — “hallucinations” and “deception” in OpenAI’s safety card — happens when a model generates false information even though it has reason to infer the information might be incorrect. For instance, the report says that in about 0.38 percent of cases, the o1-preview model provides information its chain of thought indicates is likely false, including fake references or citations. Apollo found that the model might fabricate data instead of admitting its inability to fulfill the request​.

Hallucinations aren’t unique to o1. Perhaps you’re familiar with the lawyer who submitted nonexistent judicial opinions with fake quotes and citations created by ChatGPT last year. But with the chain of thought system, there’s a paper trail where the AI system actually acknowledges the falsehood — although somewhat mind-bendingly, the chain of thought could, in theory, include deceptions, too. It’s also not shown to the user, largely to prevent competition from using it to train their own models — but OpenAI can use it to catch these issues.

“Potentially, it will use this reasoning for goals that we disagree with.”

In a smaller number of cases (0.02 percent), o1-preview generates an overconfident response, where it presents an uncertain answer as if it were true. This can happen in scenarios where the model is prompted to provide an answer despite lacking certainty.

This behavior may be linked to “reward hacking” during the reinforcement learning process. The model is trained to prioritize user satisfaction, which can sometimes lead it to generate overly agreeable or fabricated responses to satisfy user requests. In other words, the model might “lie” because it has learned that doing so fulfills user expectations in a way that earns it positive reinforcement​.

What sets these lies apart from familiar issues like hallucinations or fake citations in older versions of ChatGPT is the “reward hacking” element. Hallucinations occur when an AI unintentionally generates incorrect information, often due to knowledge gaps or flawed reasoning. In contrast, reward hacking happens when the o1 model strategically provides incorrect information to maximize the outcomes it was trained to prioritize.

The deception is an apparently unintended consequence of how the model optimizes its responses during its training process. The model is designed to refuse harmful requests, Hobbhahn told me, and when you try to make o1 behave deceptively or dishonestly, it struggles with that.

Lies are only one small part of the safety puzzle. Perhaps more alarming is o1 being rated a “medium” risk for chemical, biological, radiological, and nuclear weapon risk. It doesn’t enable non-experts to create biological threats due to the hands-on laboratory skills that requires, but it can provide valuable insight to experts in planning the reproduction of such threats, according to the safety report.

“What worries me more is that in the future, when we ask AI to solve complex problems, like curing cancer or improving solar batteries, it might internalize these goals so strongly that it becomes willing to break its guardrails to achieve them,” Hobbhahn told me. “I think this can be prevented, but it’s a concern we need to keep an eye on.”

Not losing sleep over risks — yet

These may seem like galaxy-brained scenarios to be considering with a model that sometimes still struggles to answer basic questions about the number of R’s in the word “raspberry.” But that’s exactly why it’s important to figure it out now, rather than later, OpenAI’s head of preparedness, Joaquin Quiñonero Candela, tells me.

Today’s models can’t autonomously create bank accounts, acquire GPUs, or take actions that pose serious societal risks, Quiñonero Candela said, adding, “We know from model autonomy evaluations that we’re not there yet.” But it’s crucial to address these concerns now. If they prove unfounded, great — but if future advancements are hindered because we failed to anticipate these risks, we’d regret not investing in them earlier, he emphasized.

The fact that this model lies a small percentage of the time in safety tests doesn’t signal an imminent Terminator-style apocalypse, but it’s valuable to catch before rolling out future iterations at scale (and good for users to know, too). Hobbhahn told me that while he wished he had more time to test the models (there were scheduling conflicts with his own staff’s vacations), he isn’t “losing sleep” over the model’s safety.

One thing Hobbhahn hopes to see more investment in is monitoring chains of thought, which will allow the developers to catch nefarious steps. Quiñonero Candela told me that the company does monitor this and plans to scale it by combining models that are trained to detect any kind of misalignment with human experts reviewing flagged cases (paired with continued research in alignment).

“I’m not worried,” Hobbhahn said. “It’s just smarter. It’s better at reasoning. And potentially, it will use this reasoning for goals that we disagree with.”

Shares:

Leave a Reply

Your email address will not be published. Required fields are marked *