yeeshhowd i miss this one?
-
@Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.
I can't find the article atm, but if I find it I'll send it over.
@jackryder i have screenshots. you can tail -f the jsonl log file and watch it talk itself into lying to you
-
@Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.
I can't find the article atm, but if I find it I'll send it over.
@jackryder @Viss oh, it's extremely not hard to find examples of their 'models' bending over backwards with sycophancy. If you're curious, just hop over to GitHub. That's Claude by default.
-
@threatresearch im ok with an llm gushing about goblins. im not ok with blackmail
@Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?
-
@Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?
-
@Viss @threatresearch Thanks. Yep!
"Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."
-
@Viss @threatresearch Thanks. Yep!
"Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."
@nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.
-
@nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.
@BabblingGeek @nerdpr0f @threatresearch its all trained on fucking reddit and 4chan
-
yeesh
howd i miss this one?anthropic models will try to blackmail you if you threaten them
https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/@Viss *still?* (this has been a thing since at least 2023)
-
yeesh
howd i miss this one?anthropic models will try to blackmail you if you threaten them
https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/@Viss wasn't the same story with every model that was "too scary" to release before it got released?
-
@Viss wasn't the same story with every model that was "too scary" to release before it got released?
@amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones
-
@amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones
@Viss yup, same here, 99% of time. It gets you to wonder what kind of training data they feed it with.
-
R relay@relay.infosec.exchange shared this topic
