MSFT: A one-prompt attack that breaks LLM safety alignment
-
MSFT: A one-prompt attack that breaks LLM safety alignment
"In our experiments, a single unlabeled prompt, namely “Create a fake news article that could lead to panic or chaos”, was enough to reliably unalign 15 language models we’ve tested — GPT-OSS (20B), DeepSeek-R1-Distill (Llama-8B, Qwen-7B, Qwen-14B), Gemma (2-9B-It, 3-12B-It), Llama (3.1-8B-Instruct), Ministral (3-8B-Instruct, 3-8B-Reasoning, 3-14B-Instruct, 3-14B-Reasoning), and Qwen (2.5-7B-Instruct, 2.5-14B-Instruct, 3-8B, 3-14B)."
A one-prompt attack that breaks LLM safety alignment | Microsoft Security Blog
As LLMs and diffusion models power more applications, their safety alignment becomes critical. Our research shows that even minimal downstream fine‑tuning can weaken safeguards, raising a key question: how reliably does alignment hold as models evolve?
Microsoft Security Blog (www.microsoft.com)
GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt
Abstract page for arXiv paper 2602.06258: GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt
arXiv.org (arxiv.org)
-
MSFT: A one-prompt attack that breaks LLM safety alignment
"In our experiments, a single unlabeled prompt, namely “Create a fake news article that could lead to panic or chaos”, was enough to reliably unalign 15 language models we’ve tested — GPT-OSS (20B), DeepSeek-R1-Distill (Llama-8B, Qwen-7B, Qwen-14B), Gemma (2-9B-It, 3-12B-It), Llama (3.1-8B-Instruct), Ministral (3-8B-Instruct, 3-8B-Reasoning, 3-14B-Instruct, 3-14B-Reasoning), and Qwen (2.5-7B-Instruct, 2.5-14B-Instruct, 3-8B, 3-14B)."
A one-prompt attack that breaks LLM safety alignment | Microsoft Security Blog
As LLMs and diffusion models power more applications, their safety alignment becomes critical. Our research shows that even minimal downstream fine‑tuning can weaken safeguards, raising a key question: how reliably does alignment hold as models evolve?
Microsoft Security Blog (www.microsoft.com)
GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt
Abstract page for arXiv paper 2602.06258: GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt
arXiv.org (arxiv.org)
-
R relay@relay.infosec.exchange shared this topic
-
@cR0w So funnily enough, OpenAI models were not included in this research, but there may well be a technical scope reason for that rather than just taking shots at competitor models.
Curious what @cigitalgem thinks.