yeeshhowd i miss this one?

viss@mastodon.social

yeesh
howd i miss this one?

anthropic models will try to blackmail you if you threaten them
https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

threatresearch@infosec.exchange

@Viss Have you tested it?

At least it doesn't talk about goblins. https://www.wired.com/story/openai-really-wants-codex-to-shut-up-about-goblins/

jackryder@infosec.exchange

@Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.

I can't find the article atm, but if I find it I'll send it over.

viss@mastodon.social

@threatresearch im ok with an llm gushing about goblins. im not ok with blackmail

viss@mastodon.social

@jackryder i have screenshots. you can tail -f the jsonl log file and watch it talk itself into lying to you

rootwyrm@weird.autos

@jackryder @Viss oh, it's extremely not hard to find examples of their 'models' bending over backwards with sycophancy. If you're curious, just hop over to GitHub. That's Claude by default.

nerdpr0f@infosec.exchange

@Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?

viss@mastodon.social

@nerdpr0f @threatresearch

https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

nerdpr0f@infosec.exchange

@Viss @threatresearch Thanks. Yep!

"Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

babblinggeek@infosec.exchange

@nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.

viss@mastodon.social

@BabblingGeek @nerdpr0f @threatresearch its all trained on fucking reddit and 4chan

developing_agent@mastodon.social

@Viss *still?* (this has been a thing since at least 2023)

amar@infosec.exchange

@Viss wasn't the same story with every model that was "too scary" to release before it got released?

viss@mastodon.social

@amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones

amar@infosec.exchange

@Viss yup, same here, 99% of time. It gets you to wonder what kind of training data they feed it with.

CIRCLE WITH A DOT

yeeshhowd i miss this one?