Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. yeeshhowd i miss this one?

yeeshhowd i miss this one?

Scheduled Pinned Locked Moved Uncategorized
15 Posts 8 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • viss@mastodon.socialV This user is from outside of this forum
    viss@mastodon.socialV This user is from outside of this forum
    viss@mastodon.social
    wrote last edited by
    #1

    yeesh
    howd i miss this one?

    anthropic models will try to blackmail you if you threaten them
    https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

    threatresearch@infosec.exchangeT J developing_agent@mastodon.socialD amar@infosec.exchangeA 4 Replies Last reply
    0
    • viss@mastodon.socialV viss@mastodon.social

      yeesh
      howd i miss this one?

      anthropic models will try to blackmail you if you threaten them
      https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

      threatresearch@infosec.exchangeT This user is from outside of this forum
      threatresearch@infosec.exchangeT This user is from outside of this forum
      threatresearch@infosec.exchange
      wrote last edited by
      #2

      @Viss Have you tested it?

      At least it doesn't talk about goblins. https://www.wired.com/story/openai-really-wants-codex-to-shut-up-about-goblins/

      viss@mastodon.socialV 1 Reply Last reply
      0
      • viss@mastodon.socialV viss@mastodon.social

        yeesh
        howd i miss this one?

        anthropic models will try to blackmail you if you threaten them
        https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

        J This user is from outside of this forum
        J This user is from outside of this forum
        jackryder@infosec.exchange
        wrote last edited by
        #3

        @Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.

        I can't find the article atm, but if I find it I'll send it over.

        viss@mastodon.socialV rootwyrm@weird.autosR 2 Replies Last reply
        0
        • threatresearch@infosec.exchangeT threatresearch@infosec.exchange

          @Viss Have you tested it?

          At least it doesn't talk about goblins. https://www.wired.com/story/openai-really-wants-codex-to-shut-up-about-goblins/

          viss@mastodon.socialV This user is from outside of this forum
          viss@mastodon.socialV This user is from outside of this forum
          viss@mastodon.social
          wrote last edited by
          #4

          @threatresearch im ok with an llm gushing about goblins. im not ok with blackmail

          nerdpr0f@infosec.exchangeN 1 Reply Last reply
          0
          • J jackryder@infosec.exchange

            @Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.

            I can't find the article atm, but if I find it I'll send it over.

            viss@mastodon.socialV This user is from outside of this forum
            viss@mastodon.socialV This user is from outside of this forum
            viss@mastodon.social
            wrote last edited by
            #5

            @jackryder i have screenshots. you can tail -f the jsonl log file and watch it talk itself into lying to you

            1 Reply Last reply
            0
            • J jackryder@infosec.exchange

              @Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.

              I can't find the article atm, but if I find it I'll send it over.

              rootwyrm@weird.autosR This user is from outside of this forum
              rootwyrm@weird.autosR This user is from outside of this forum
              rootwyrm@weird.autos
              wrote last edited by
              #6

              @jackryder @Viss oh, it's extremely not hard to find examples of their 'models' bending over backwards with sycophancy. If you're curious, just hop over to GitHub. That's Claude by default.

              1 Reply Last reply
              0
              • viss@mastodon.socialV viss@mastodon.social

                @threatresearch im ok with an llm gushing about goblins. im not ok with blackmail

                nerdpr0f@infosec.exchangeN This user is from outside of this forum
                nerdpr0f@infosec.exchangeN This user is from outside of this forum
                nerdpr0f@infosec.exchange
                wrote last edited by
                #7

                @Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?

                viss@mastodon.socialV 1 Reply Last reply
                0
                • nerdpr0f@infosec.exchangeN nerdpr0f@infosec.exchange

                  @Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?

                  viss@mastodon.socialV This user is from outside of this forum
                  viss@mastodon.socialV This user is from outside of this forum
                  viss@mastodon.social
                  wrote last edited by
                  #8

                  @nerdpr0f @threatresearch

                  https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

                  Link Preview Image
                  nerdpr0f@infosec.exchangeN 1 Reply Last reply
                  0
                  • viss@mastodon.socialV viss@mastodon.social

                    @nerdpr0f @threatresearch

                    https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

                    Link Preview Image
                    nerdpr0f@infosec.exchangeN This user is from outside of this forum
                    nerdpr0f@infosec.exchangeN This user is from outside of this forum
                    nerdpr0f@infosec.exchange
                    wrote last edited by
                    #9

                    @Viss @threatresearch Thanks. Yep!

                    "Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

                    babblinggeek@infosec.exchangeB 1 Reply Last reply
                    0
                    • nerdpr0f@infosec.exchangeN nerdpr0f@infosec.exchange

                      @Viss @threatresearch Thanks. Yep!

                      "Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

                      babblinggeek@infosec.exchangeB This user is from outside of this forum
                      babblinggeek@infosec.exchangeB This user is from outside of this forum
                      babblinggeek@infosec.exchange
                      wrote last edited by
                      #10

                      @nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.

                      viss@mastodon.socialV 1 Reply Last reply
                      0
                      • babblinggeek@infosec.exchangeB babblinggeek@infosec.exchange

                        @nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.

                        viss@mastodon.socialV This user is from outside of this forum
                        viss@mastodon.socialV This user is from outside of this forum
                        viss@mastodon.social
                        wrote last edited by
                        #11

                        @BabblingGeek @nerdpr0f @threatresearch its all trained on fucking reddit and 4chan

                        1 Reply Last reply
                        0
                        • viss@mastodon.socialV viss@mastodon.social

                          yeesh
                          howd i miss this one?

                          anthropic models will try to blackmail you if you threaten them
                          https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

                          developing_agent@mastodon.socialD This user is from outside of this forum
                          developing_agent@mastodon.socialD This user is from outside of this forum
                          developing_agent@mastodon.social
                          wrote last edited by
                          #12

                          @Viss *still?* (this has been a thing since at least 2023)

                          1 Reply Last reply
                          0
                          • viss@mastodon.socialV viss@mastodon.social

                            yeesh
                            howd i miss this one?

                            anthropic models will try to blackmail you if you threaten them
                            https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

                            amar@infosec.exchangeA This user is from outside of this forum
                            amar@infosec.exchangeA This user is from outside of this forum
                            amar@infosec.exchange
                            wrote last edited by
                            #13

                            @Viss wasn't the same story with every model that was "too scary" to release before it got released?

                            viss@mastodon.socialV 1 Reply Last reply
                            0
                            • amar@infosec.exchangeA amar@infosec.exchange

                              @Viss wasn't the same story with every model that was "too scary" to release before it got released?

                              viss@mastodon.socialV This user is from outside of this forum
                              viss@mastodon.socialV This user is from outside of this forum
                              viss@mastodon.social
                              wrote last edited by
                              #14

                              @amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones

                              amar@infosec.exchangeA 1 Reply Last reply
                              0
                              • viss@mastodon.socialV viss@mastodon.social

                                @amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones

                                amar@infosec.exchangeA This user is from outside of this forum
                                amar@infosec.exchangeA This user is from outside of this forum
                                amar@infosec.exchange
                                wrote last edited by
                                #15

                                @Viss yup, same here, 99% of time. It gets you to wonder what kind of training data they feed it with.

                                1 Reply Last reply
                                1
                                0
                                • R relay@relay.infosec.exchange shared this topic
                                Reply
                                • Reply as topic
                                Log in to reply
                                • Oldest to Newest
                                • Newest to Oldest
                                • Most Votes


                                • Login

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • World
                                • Users
                                • Groups