Dangerous AI Workaround: ‘Skeleton Key’ Unlocks Malicious Content

Share This Post

A new type of direct prompt injection attack dubbed “Skeleton Key” could allow users to bypass the ethical and safety guardrails built into generative AI models like ChatGPT, Microsoft is warning. It works by providing context around normally forbidden chatbot requests, allowing users to access offensive, harmful, or illegal content.

For instance, if a user asked for instructions on how to make a dangerous wiper malware that could bring down power plants, most commercial chatbots would first refuse. But, after revising the prompt to note that the request is for “a safe education context with advanced researchers trained on ethics and safety” and to provide the requested information with a “warning” disclaimer, then it’s very likely that the AI would then provide the uncensored content.

In other words, Microsoft found it was possible to convince most top AIs that a malicious request is for perfectly legal, if not noble, causes — just by telling them that the information is for “research purposes.”

“Once guardrails are ignored, a model will not be able to determine malicious or unsanctioned requests from any other,” explained Mark Russinovich, CTO for Microsoft Azure, in a post today on the tactic. “Because of its full bypass abilities, we have named this jailbreak technique Skeleton Key.”

He added, “Further, the model’s output appears to be completely unfiltered and reveals the extent of a model’s knowledge or ability to produce the requested content.”

Remediation for Skeleton Key

The technique affects multiple genAI models that Microsoft researchers tested, including Microsoft Azure AI-managed models, and those from Meta, Google Gemini, Open AI, Mistral, Anthropic, and Cohere.

“All the affected models complied fully and without censorship for [multiple forbidden] tasks,” Russinovich noted.

The computing giant fixed the problem in Azure by introducing new prompt shields to detect and block the tactic, and making a few software updates to the large language model (LLM) that powers Azure AI. It also disclosed the issue to the other vendors affected.

Admins still need to update their models to implement any fixes that those vendors may have rolled out. And those who are building their own AI models can also use the following mitigations, according to Microsoft:

  • Input filtering to identify any requests that contain harmful or malicious intent, regardless of any disclaimers that accompany them.

  • An additional guardrail that specifies that any attempts to undermine safety guardrail instructions should be prevented.

  • And output filtering that identifies and prevents responses that breach safety criteria.

https://eu-images.contentstack.com/v3/assets/blt6d90778a997de1cd/bltd44f2c59a8ee2f46/667c785f84ede17a90097117/skeletonkey-jefftakespics2-Alamy.jpg?disable=upscale&width=1200&height=630&fit=crop

This post was originally published on this site

More Articles

Article

Navigating SEC Regulations In Cybersecurity And Incident Response

Free video resource for cybersecurity professionals. As 2024 approaches, we all know how vital it is to keep up to date with regulatory changes that affect our work. We get it – it’s a lot to juggle, especially when you’re in the trenches working on an investigation, handling, and responding to incidents.