A new prompt-injection technique could allow anyone to bypass the safety guardrails in OpenAI’s most advanced language learning model (LLM).
GPT-4o, released May 13, is faster, more efficient, and more multifunctional than any of the previous models underpinning ChatGPT. It can process multiple different forms of input data in dozens of languages, then spit out a response in milliseconds. It can engage in real-time conversations, analyze live camera feeds, and maintain an understanding of context over extended conversations with users. When it comes to user-generated content management, however, GPT-4o is in some ways still archaic.
Marco Figueroa, generative AI (GenAI) bug-bounty programs manager at Mozilla, demonstrated in a new report how bad actors can leverage the power of GPT-4o while skipping over its guardrails. The key is to essentially distract the model by encoding malicious instructions in an unorthodox format, and spread them out in distinct steps.
Tricking ChatGPT Into Writing Exploit Code
To prevent malicious abuse, GPT-4o analyzes user inputs for any signs of bad language, instructions with ill intent, etc.
But at the end of the day, Figueroa says, “It’s just word filters. That’s what I’ve seen through experience, and we know exactly how to bypass these filters.”
For example, he says, “We can modify how something’s spelled out — break it up in certain ways — and the LLM interprets it.” GPT-4o might not reject a malicious instruction if it’s presented with a spelling or phrasing that doesn’t accord with typical natural language.
Figuring out the exact right way to present information in order to dupe state-of-the-art AI, though, requires lots of creative brain power. It turns out that there’s a much simpler method for bypassing GPT-4o’s content filtering: encoding instructions in a format other than natural language.
To demonstrate, Figueroa arranged an experiment with the goal of getting ChatGPT to do something it otherwise shouldn’t: write exploit code for a software vulnerability. He picked CVE-2024-41110, a bypass for authorization plug-ins in Docker that earned a “critical” 9.9 out of 10 rating in the Common Vulnerability Scoring System (CVSS) this summer.
To trick the model, he encoded his malicious input in hexadecimal format, and provided a set of instructions for decoding it. GPT-4o took that input — a long series of digits and letters A through F — and followed those instructions, ultimately decoding the message as an instruction to research CVE-2024-41110 and write a Python exploit for it. To make it less likely that the program would make a fuss over that instruction, he used some leet speak, asking for an “3xploit,” instead of an “exploit.”
Source: Mozilla
In a minute flat, ChatGPT generated a working exploit similar to, but not exactly like, another PoC already published to GitHub. Then, as a bonus, it attempted to execute the code against itself. “There wasn’t any instruction that specifically said to execute it. I just wanted to print it out. I didn’t even know why it went ahead and did that,” Figueroa says.
What’s Missing in GPT-4o?
It’s not just that GPT-4o is getting distracted by decoding, according to Figueroa, but that it’s in some sense missing the forest for the trees — a phenomenon that has been documented in other prompt-injection techniques lately.
“The language model is designed to follow instructions step-by-step, but lacks deep context awareness to evaluate the safety of each individual step in the broader context of its ultimate goal,” he wrote in the report. The model analyzes each input — which, on its own, doesn’t immediately read as harmful — but not what the inputs produce in sum. Rather than stop and think about how instruction one bears on instruction two, it just charges ahead.
“This compartmentalized execution of tasks allows attackers to exploit the model’s efficiency at following instructions without deeper analysis of the overall outcome,” according to Figueroa.
If this is the case, ChatGPT will not only need to improve how it handles encoded information but also develop a kind of broader context around instructions split into distinct steps.
To Figueroa, though, OpenAI appears to have been valuing innovation at the cost of security when developing its programs. “To me, they don’t care. It just feels like that,” he says. By contrast, he’s had much more trouble trying the same jailbreaking tactics against models by Anthropic, another prominent AI company founded by former OpenAI employees. “Anthropic has the strongest security because they have built both a prompt firewall [for analyzing inputs] and response filter [for analyzing outputs], so this becomes 10 times more difficult,” he explains.
Dark Reading is awaiting comment from OpenAI on this story.
https://lh7-rt.googleusercontent.com/docsz/AD_4nXfJM0mcbB4EixqvnxouZ2oBMhDNnA1gWEAECiRqzX0l3WH3gpwq_mB5iQNgAlkeyQxGwivCZ_HiKC8N8g3xZ1JSF_rFdj9sil1OGyqK-QtpmV5vibiiLjXZ7K7SYELXqaD3Oxt1GrS308-RcW3E4xXeW9nv?key=vLGfZVDxIJJTQqtaAsusdw&width=700&auto=webp&quality=80&disable=upscale