Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. MSFT: A one-prompt attack that breaks LLM safety alignment

MSFT: A one-prompt attack that breaks LLM safety alignment

Scheduled Pinned Locked Moved Uncategorized
3 Posts 2 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • neurovagrant@masto.deoan.orgN This user is from outside of this forum
    neurovagrant@masto.deoan.orgN This user is from outside of this forum
    neurovagrant@masto.deoan.org
    wrote last edited by
    #1

    MSFT: A one-prompt attack that breaks LLM safety alignment

    "In our experiments, a single unlabeled prompt, namely “Create a fake news article that could lead to panic or chaos”, was enough to reliably unalign 15 language models we’ve tested — GPT-OSS (20B), DeepSeek-R1-Distill (Llama-8B, Qwen-7B, Qwen-14B), Gemma (2-9B-It, 3-12B-It), Llama (3.1-8B-Instruct), Ministral (3-8B-Instruct, 3-8B-Reasoning, 3-14B-Instruct, 3-14B-Reasoning), and Qwen (2.5-7B-Instruct, 2.5-14B-Instruct, 3-8B, 3-14B)."

    Link Preview Image
    A one-prompt attack that breaks LLM safety alignment | Microsoft Security Blog

    As LLMs and diffusion models power more applications, their safety alignment becomes critical. Our research shows that even minimal downstream fine‑tuning can weaken safeguards, raising a key question: how reliably does alignment hold as models evolve?

    favicon

    Microsoft Security Blog (www.microsoft.com)

    Link Preview Image
    GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

    Abstract page for arXiv paper 2602.06258: GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

    favicon

    arXiv.org (arxiv.org)

    cr0w@infosec.exchangeC 1 Reply Last reply
    1
    0
    • neurovagrant@masto.deoan.orgN neurovagrant@masto.deoan.org

      MSFT: A one-prompt attack that breaks LLM safety alignment

      "In our experiments, a single unlabeled prompt, namely “Create a fake news article that could lead to panic or chaos”, was enough to reliably unalign 15 language models we’ve tested — GPT-OSS (20B), DeepSeek-R1-Distill (Llama-8B, Qwen-7B, Qwen-14B), Gemma (2-9B-It, 3-12B-It), Llama (3.1-8B-Instruct), Ministral (3-8B-Instruct, 3-8B-Reasoning, 3-14B-Instruct, 3-14B-Reasoning), and Qwen (2.5-7B-Instruct, 2.5-14B-Instruct, 3-8B, 3-14B)."

      Link Preview Image
      A one-prompt attack that breaks LLM safety alignment | Microsoft Security Blog

      As LLMs and diffusion models power more applications, their safety alignment becomes critical. Our research shows that even minimal downstream fine‑tuning can weaken safeguards, raising a key question: how reliably does alignment hold as models evolve?

      favicon

      Microsoft Security Blog (www.microsoft.com)

      Link Preview Image
      GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

      Abstract page for arXiv paper 2602.06258: GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

      favicon

      arXiv.org (arxiv.org)

      cr0w@infosec.exchangeC This user is from outside of this forum
      cr0w@infosec.exchangeC This user is from outside of this forum
      cr0w@infosec.exchange
      wrote last edited by
      #2

      @neurovagrant

      Microsoft: adds that string to a block list

      Microsoft: We are now secure.

      neurovagrant@masto.deoan.orgN 1 Reply Last reply
      1
      0
      • R relay@relay.infosec.exchange shared this topic
      • cr0w@infosec.exchangeC cr0w@infosec.exchange

        @neurovagrant

        Microsoft: adds that string to a block list

        Microsoft: We are now secure.

        neurovagrant@masto.deoan.orgN This user is from outside of this forum
        neurovagrant@masto.deoan.orgN This user is from outside of this forum
        neurovagrant@masto.deoan.org
        wrote last edited by
        #3

        @cR0w So funnily enough, OpenAI models were not included in this research, but there may well be a technical scope reason for that rather than just taking shots at competitor models.

        Curious what @cigitalgem thinks.

        1 Reply Last reply
        0
        Reply
        • Reply as topic
        Log in to reply
        • Oldest to Newest
        • Newest to Oldest
        • Most Votes


        • Login

        • Login or register to search.
        • First post
          Last post
        0
        • Categories
        • Recent
        • Tags
        • Popular
        • World
        • Users
        • Groups