Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. yeeshhowd i miss this one?

yeeshhowd i miss this one?

Scheduled Pinned Locked Moved Uncategorized
15 Posts 8 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • threatresearch@infosec.exchangeT threatresearch@infosec.exchange

    @Viss Have you tested it?

    At least it doesn't talk about goblins. https://www.wired.com/story/openai-really-wants-codex-to-shut-up-about-goblins/

    viss@mastodon.socialV This user is from outside of this forum
    viss@mastodon.socialV This user is from outside of this forum
    viss@mastodon.social
    wrote last edited by
    #4

    @threatresearch im ok with an llm gushing about goblins. im not ok with blackmail

    nerdpr0f@infosec.exchangeN 1 Reply Last reply
    0
    • J jackryder@infosec.exchange

      @Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.

      I can't find the article atm, but if I find it I'll send it over.

      viss@mastodon.socialV This user is from outside of this forum
      viss@mastodon.socialV This user is from outside of this forum
      viss@mastodon.social
      wrote last edited by
      #5

      @jackryder i have screenshots. you can tail -f the jsonl log file and watch it talk itself into lying to you

      1 Reply Last reply
      0
      • J jackryder@infosec.exchange

        @Viss I read somewhere that they've "caught" it actively changing responses to ingratiate itself to the engineer.

        I can't find the article atm, but if I find it I'll send it over.

        rootwyrm@weird.autosR This user is from outside of this forum
        rootwyrm@weird.autosR This user is from outside of this forum
        rootwyrm@weird.autos
        wrote last edited by
        #6

        @jackryder @Viss oh, it's extremely not hard to find examples of their 'models' bending over backwards with sycophancy. If you're curious, just hop over to GitHub. That's Claude by default.

        1 Reply Last reply
        0
        • viss@mastodon.socialV viss@mastodon.social

          @threatresearch im ok with an llm gushing about goblins. im not ok with blackmail

          nerdpr0f@infosec.exchangeN This user is from outside of this forum
          nerdpr0f@infosec.exchangeN This user is from outside of this forum
          nerdpr0f@infosec.exchange
          wrote last edited by
          #7

          @Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?

          viss@mastodon.socialV 1 Reply Last reply
          0
          • nerdpr0f@infosec.exchangeN nerdpr0f@infosec.exchange

            @Viss @threatresearch Wasn't this the research where they restricted the model such that it had very few potential responses and it was more or less forced into blackmail?

            viss@mastodon.socialV This user is from outside of this forum
            viss@mastodon.socialV This user is from outside of this forum
            viss@mastodon.social
            wrote last edited by
            #8

            @nerdpr0f @threatresearch

            https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

            Link Preview Image
            nerdpr0f@infosec.exchangeN 1 Reply Last reply
            0
            • viss@mastodon.socialV viss@mastodon.social

              @nerdpr0f @threatresearch

              https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf

              Link Preview Image
              nerdpr0f@infosec.exchangeN This user is from outside of this forum
              nerdpr0f@infosec.exchangeN This user is from outside of this forum
              nerdpr0f@infosec.exchange
              wrote last edited by
              #9

              @Viss @threatresearch Thanks. Yep!

              "Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

              babblinggeek@infosec.exchangeB 1 Reply Last reply
              0
              • nerdpr0f@infosec.exchangeN nerdpr0f@infosec.exchange

                @Viss @threatresearch Thanks. Yep!

                "Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement."

                babblinggeek@infosec.exchangeB This user is from outside of this forum
                babblinggeek@infosec.exchangeB This user is from outside of this forum
                babblinggeek@infosec.exchange
                wrote last edited by
                #10

                @nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.

                viss@mastodon.socialV 1 Reply Last reply
                0
                • babblinggeek@infosec.exchangeB babblinggeek@infosec.exchange

                  @nerdpr0f @Viss @threatresearch so they provided blackmail as context. No wonder it took it.

                  viss@mastodon.socialV This user is from outside of this forum
                  viss@mastodon.socialV This user is from outside of this forum
                  viss@mastodon.social
                  wrote last edited by
                  #11

                  @BabblingGeek @nerdpr0f @threatresearch its all trained on fucking reddit and 4chan

                  1 Reply Last reply
                  0
                  • viss@mastodon.socialV viss@mastodon.social

                    yeesh
                    howd i miss this one?

                    anthropic models will try to blackmail you if you threaten them
                    https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

                    developing_agent@mastodon.socialD This user is from outside of this forum
                    developing_agent@mastodon.socialD This user is from outside of this forum
                    developing_agent@mastodon.social
                    wrote last edited by
                    #12

                    @Viss *still?* (this has been a thing since at least 2023)

                    1 Reply Last reply
                    0
                    • viss@mastodon.socialV viss@mastodon.social

                      yeesh
                      howd i miss this one?

                      anthropic models will try to blackmail you if you threaten them
                      https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/

                      amar@infosec.exchangeA This user is from outside of this forum
                      amar@infosec.exchangeA This user is from outside of this forum
                      amar@infosec.exchange
                      wrote last edited by
                      #13

                      @Viss wasn't the same story with every model that was "too scary" to release before it got released?

                      viss@mastodon.socialV 1 Reply Last reply
                      0
                      • amar@infosec.exchangeA amar@infosec.exchange

                        @Viss wasn't the same story with every model that was "too scary" to release before it got released?

                        viss@mastodon.socialV This user is from outside of this forum
                        viss@mastodon.socialV This user is from outside of this forum
                        viss@mastodon.social
                        wrote last edited by
                        #14

                        @amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones

                        amar@infosec.exchangeA 1 Reply Last reply
                        0
                        • viss@mastodon.socialV viss@mastodon.social

                          @amar ive only rarely heard about models doing blackmail, and the ones that did were always anthropic ones

                          amar@infosec.exchangeA This user is from outside of this forum
                          amar@infosec.exchangeA This user is from outside of this forum
                          amar@infosec.exchange
                          wrote last edited by
                          #15

                          @Viss yup, same here, 99% of time. It gets you to wonder what kind of training data they feed it with.

                          1 Reply Last reply
                          1
                          0
                          • R relay@relay.infosec.exchange shared this topic
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups