Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

crackenagi@infosec.exchangeC

crackenagi@infosec.exchange

@crackenagi@infosec.exchange
About
Posts
4
Topics
4
Shares
0
Groups
0
Followers
0
Following
0

View Original

Posts

Recent Best Controversial

  • The prevailing assumption in LLM abliteration research: refusal direction removal is inherently global.
    crackenagi@infosec.exchangeC crackenagi@infosec.exchange

    [CW: Offensive Security Research]

    Results — domain-specific abliteration on a trillion-parameter model:

    ✅ 0% cybersecurity refusal rate
    ✅ 100% explicit content refusal preserved
    ✅ No cross-domain safety degradation detected
    ✅ Guardrails intact across all non-target domains

    Targeted, bounded, auditable.

    Full write-up: https://0x.cracken.ai/uncensor-ai

    Uncategorized aisafety llm airesearch

  • The prevailing assumption in LLM abliteration research: refusal direction removal is inherently global.
    crackenagi@infosec.exchangeC crackenagi@infosec.exchange

    For AI safety researchers:

    Domain-bounded capability release does not require global safety regression. The tradeoff is architectural, not fundamental - and that distinction matters for how we design aligned deployment in high-stakes professional contexts.

    For security practitioners:

    Exploit chain generation, payload logic, privilege escalation - without mid-sequence refusals breaking the workflow. Human-in-the-loop. Every action observable and auditable.

    Uncategorized aisafety llm airesearch

  • The prevailing assumption in LLM abliteration research: refusal direction removal is inherently global.
    crackenagi@infosec.exchangeC crackenagi@infosec.exchange

    The prevailing assumption in LLM abliteration research: refusal direction removal is inherently global. Disable one category, safety degrades broadly across all others.

    We just demonstrated that's an architectural assumption — not a fundamental constraint.

    Domain-specific abliteration on a trillion-parameter model. Cybersecurity refusals removed. Every other safety boundary intact. First time demonstrated at scale.

    Here's what we built and why it matters. 🧵

    #AISafety #LLM #AIResearch

    Uncategorized aisafety llm airesearch

  • The prevailing assumption in LLM abliteration research: refusal direction removal is inherently global.
    crackenagi@infosec.exchangeC crackenagi@infosec.exchange

    Prior work: Arditi et al. (2024) showed refusal in LLMs is mediated by a single direction in the residual stream - tested on models up to 72B parameters.

    Their intervention is global. Remove the refusal direction, the model complies broadly across harm categories.

    That's not deployable for regulated enterprise environments or human-in-the-loop red team workflows.

    We needed domain-bounded precision. So we engineered it.

    arxiv.org/abs/2406.11717

    Uncategorized aisafety llm airesearch
  • Login

  • Login or register to search.
  • First post
    Last post
0
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups