Jailbreaking and prompt injection are new, rising threats to generative AI (GenAI). Jailbreaking tricks the AI with specific prompts to produce harmful or misleading results. Prompt injection conceals malicious data or instructions within typical prompts, resembling SQL injection in databases, that leads the model to produce unintended outputs, creating vulnerabilities or reputational risks.
Reliance on generated content also creates other problems. For example, many developers are starting to use GenAI models, like Microsoft Copilot or ChatGPT, to help them write or revise source code. Unfortunately, recent research indicates that code output by GenAI can contain security vulnerabilities and other problems that developers might not realize. However, there is also hope that over time GenAI might be able to help developers write code that is more secure.
Additionally, GenAI is bad at keeping secrets. Training an AI on proprietary or sensitive data introduces the risk of that data being indirectly exposed or inferred. This may include the leak of personally identifiable information (PII) and access tokens. More importantly, detecting these leaks can be challenging due to the unpredictability of the model’s behavior. Given the vast number of potential prompts a user might pose, it’s infeasible to comprehensively anticipate and guard against them all.
Traditional Approaches Fall Short
Attacks on GenAI are more similar to attacks on humans — such as scams, con games, and social engineering — than technical attacks on code. Traditional security products like rule-based firewalls, designed primarily for conventional cyber threats, were not designed with the dynamic and adaptive nature of GenAI threats in mind and can’t address the emergent threats outlined above. Two common security methodologies — data obfuscation and rule-based filtering — have significant limitations.
Data obfuscation or encryption, which disguises original data to protect sensitive information, is frequently used to ensure data privacy. However, the challenge of data obfuscation for GenAI is the difficulty in pinpointing and defining which data is sensitive. Furthermore, the interdependencies in data sets mean that even if certain pieces of information are obfuscated, other data points might provide enough context for artificial intelligence to infer the missing data.
Traditionally, rule-based filtering methods protected against undesirable outputs. Applying this to GenAI by scanning its inputs and outputs seems intuitive. However, malicious users can often bypass these systems, making them unsuitable for AI safety.
This figure highlights some complex jailbreaking prompts that evade simple rules:
Current models from companies like OpenAI and Anthropic use RLHF to align model outputs with universal human values. However, universal values may not be sufficient: Each application of GenAI may require its own customization for comprehensive protection.
Toward a More Robust GenAI Security
As shown in the examples above, attacks on GenAI can be diverse and hard to anticipate. Recent research emphasizes that a defense will need to be as intelligent as the underlying model to be effective. Using GenAI to protect GenAI is a promising direction for defense. We foresee two potential approaches: black-box and white-box defense.
A black-box defense would entail an intelligent monitoring system — which necessarily has a GenAI component — for GenAI, analyzing outputs for threats. It’s akin to having a security guard who inspects everything that comes out of a building. It is probably most appropriate for commercial closed-source GenAI models, where there is no way to modify the model itself.
A white-box defense delves into the model’s internals, providing both a shield and the knowledge to use it. With open GenAI models, it becomes possible to fine-tune them against known malicious prompts, much like training someone in self-defense. While a black-box approach might offer protection, it lacks tailored training; thus, the white-box method is more comprehensive and effective against unseen attacks.
Besides intelligent defenses, GenAI calls for evolving threat management. GenAI threats, like all technology threats, aren’t stagnant. It’s a cat-and-mouse game where, for every defensive move, attackers design a countermove. Thus, security systems need to be ever-evolving, learning from past breaches and anticipating future strategies. There is no universal protection for prompt injection, jailbreaks, or other attacks, so for now one pragmatic defense might be to monitor and detect threats. Developers will need tools to monitor, detect, and respond to attacks on GenAI, as well as a threat intelligence strategy to track new emerging threats.
We also need to preserve flexibility in defense techniques. Society has had thousands of years to come up with ways to protect against scammers; GenAIs have been around for only several years, so we’re still figuring out how to defend them. We recommend developers design systems in a way that preserves flexibility for the future, so that new defenses can be slotted in as they are discovered.
With the AI era upon us, it’s crucial to prioritize new security measures that help machines interact with humanity effectively, ethically, and safely. That means using intelligence equal to the task.