Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

Scheduled Pinned Locked Moved Uncategorized
llm
50 Posts 34 Posters 1 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • pseudonym@mastodon.onlineP This user is from outside of this forum
    pseudonym@mastodon.onlineP This user is from outside of this forum
    pseudonym@mastodon.online
    wrote last edited by
    #1

    If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

    That's a cognitively brutal task.

    Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

    I propose any productivity gains will be consumed by false negative review failures.

    peteriskrisjanis@toot.lvP R avuko@infosec.exchangeA hopeless@mas.toH ainmosni@social.ainmosni.euA 24 Replies Last reply
    2
    0
    • pseudonym@mastodon.onlineP pseudonym@mastodon.online

      If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

      That's a cognitively brutal task.

      Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

      I propose any productivity gains will be consumed by false negative review failures.

      peteriskrisjanis@toot.lvP This user is from outside of this forum
      peteriskrisjanis@toot.lvP This user is from outside of this forum
      peteriskrisjanis@toot.lv
      wrote last edited by
      #2

      @pseudonym basically this is why this method is failure from get go.

      1 Reply Last reply
      0
      • pseudonym@mastodon.onlineP pseudonym@mastodon.online

        If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

        That's a cognitively brutal task.

        Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

        I propose any productivity gains will be consumed by false negative review failures.

        R This user is from outside of this forum
        R This user is from outside of this forum
        robinadams@mathstodon.xyz
        wrote last edited by
        #3

        @pseudonym Especially since the sort of mistake that LLMs make is the sort of mistake that's hardest for humans to spot. They produce bad code that looks like good code, because they were trained on a lot of good code and told "Write code that looks like this".

        iwein@mas.toI 1 Reply Last reply
        0
        • pseudonym@mastodon.onlineP pseudonym@mastodon.online

          If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

          That's a cognitively brutal task.

          Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

          I propose any productivity gains will be consumed by false negative review failures.

          avuko@infosec.exchangeA This user is from outside of this forum
          avuko@infosec.exchangeA This user is from outside of this forum
          avuko@infosec.exchange
          wrote last edited by
          #4

          @pseudonym and because the high volume consists of what I’ve dubbed “plausible bullshit”, reviewers will have to battle a plethora of their biases as well.

          There are fields (I’ve heard stories about protein and material design, and vulnerability discovery) where filtering the BS for real discoveries can be worth it. I’m guessing it works because there is a reality to test against.

          But for the love of humanity, don’t use it for anything descriptive or abstract.

          els@sfba.socialE michael@westergaard.socialM 2 Replies Last reply
          0
          • pseudonym@mastodon.onlineP pseudonym@mastodon.online

            If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

            That's a cognitively brutal task.

            Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

            I propose any productivity gains will be consumed by false negative review failures.

            hopeless@mas.toH This user is from outside of this forum
            hopeless@mas.toH This user is from outside of this forum
            hopeless@mas.to
            wrote last edited by
            #5

            @pseudonym It's certainly like that.

            FWIW though LLMs don't have any shame or feeling they need to manage their reputation.

            If you tell the same LLM that produced the report that it is now the QA manager and it must review the report from the standpoints of checking for missing or inaccurate citations, dubious claims or non-concise text, it will rat itself out and can be told to fix what it found.

            This is the same LLM entirely...

            nor4@chaos.socialN 1 Reply Last reply
            0
            • avuko@infosec.exchangeA avuko@infosec.exchange

              @pseudonym and because the high volume consists of what I’ve dubbed “plausible bullshit”, reviewers will have to battle a plethora of their biases as well.

              There are fields (I’ve heard stories about protein and material design, and vulnerability discovery) where filtering the BS for real discoveries can be worth it. I’m guessing it works because there is a reality to test against.

              But for the love of humanity, don’t use it for anything descriptive or abstract.

              els@sfba.socialE This user is from outside of this forum
              els@sfba.socialE This user is from outside of this forum
              els@sfba.social
              wrote last edited by
              #6

              @avuko @pseudonym The main reason that machine learning works so well with material and protein design, weather forecasting, and such, is that there is good data available to “train” the model. The internet is the source of LLM training. It is full of garbage and LLMs are filling it with more garbage. The rule is the same as in 1970: GIGO (garbage in, garbage out). Only the scale is different.

              1 Reply Last reply
              0
              • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                That's a cognitively brutal task.

                Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                I propose any productivity gains will be consumed by false negative review failures.

                ainmosni@social.ainmosni.euA This user is from outside of this forum
                ainmosni@social.ainmosni.euA This user is from outside of this forum
                ainmosni@social.ainmosni.eu
                wrote last edited by
                #7

                @pseudonym This was my experience from the start, and is what made me gave up on LLM assisted coding. Of course, that was before I was aware of the abhorrent externalities that came with using the slop machine...

                pseudonym@mastodon.onlineP 1 Reply Last reply
                0
                • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                  If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                  That's a cognitively brutal task.

                  Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                  I propose any productivity gains will be consumed by false negative review failures.

                  xrisk@social.treehouse.systemsX This user is from outside of this forum
                  xrisk@social.treehouse.systemsX This user is from outside of this forum
                  xrisk@social.treehouse.systems
                  wrote last edited by
                  #8

                  @pseudonym is the problem the increased volume of code that the LLM is producing (as compared to the junior dev) — what you are calling “productivity gains"? because I can see this same argument being made for code produced by humans as well.

                  madhu_shrieks@mastinsaan.inM malstrom@metalhead.clubM 2 Replies Last reply
                  0
                  • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                    If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                    That's a cognitively brutal task.

                    Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                    I propose any productivity gains will be consumed by false negative review failures.

                    hw@fediscience.orgH This user is from outside of this forum
                    hw@fediscience.orgH This user is from outside of this forum
                    hw@fediscience.org
                    wrote last edited by
                    #9

                    @pseudonym I follow many git repositories just out of general interest. In the past month or so, many of their subscription feeds have become unreadable for me because of the agents writing verbose messages all the time. The projects might get a lot of features, but like your wrote, who has the energy to read their outputs?

                    1 Reply Last reply
                    0
                    • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                      If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                      That's a cognitively brutal task.

                      Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                      I propose any productivity gains will be consumed by false negative review failures.

                      koen_hufkens@mastodon.socialK This user is from outside of this forum
                      koen_hufkens@mastodon.socialK This user is from outside of this forum
                      koen_hufkens@mastodon.social
                      wrote last edited by
                      #10

                      @pseudonym Amen to that. I don't even trust myself using one for this exact reason. At 10x the speed you will zip by your own mistakes.

                      1 Reply Last reply
                      0
                      • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                        If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                        That's a cognitively brutal task.

                        Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                        I propose any productivity gains will be consumed by false negative review failures.

                        shanecelis@mastodon.gamedev.placeS This user is from outside of this forum
                        shanecelis@mastodon.gamedev.placeS This user is from outside of this forum
                        shanecelis@mastodon.gamedev.place
                        wrote last edited by
                        #11

                        @pseudonym TIRED: 10x developer

                        HIRED: 10x junior intern

                        ALSO TIRED: Senior developer reviewing junior's copious output.

                        1 Reply Last reply
                        0
                        • R relay@relay.an.exchange shared this topic
                        • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                          If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                          That's a cognitively brutal task.

                          Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                          I propose any productivity gains will be consumed by false negative review failures.

                          tristan@sns.tcl.meT This user is from outside of this forum
                          tristan@sns.tcl.meT This user is from outside of this forum
                          tristan@sns.tcl.me
                          wrote last edited by
                          #12

                          @pseudonym Recent Microsoft update releases seem to be a great case study for that

                          1 Reply Last reply
                          0
                          • R relay@relay.publicsquare.global shared this topic
                          • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                            If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                            That's a cognitively brutal task.

                            Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                            I propose any productivity gains will be consumed by false negative review failures.

                            moink@fedi.splitbrain.orgM This user is from outside of this forum
                            moink@fedi.splitbrain.orgM This user is from outside of this forum
                            moink@fedi.splitbrain.org
                            wrote last edited by
                            #13

                            @pseudonym That and LLM code often looks very nice on the surface so it takes a lot of vigilance and thinking to find the subtle errors. Code from juniors tends to have more immediate signs of errors or wrong mental models.

                            wronglang@bayes.clubW 1 Reply Last reply
                            0
                            • xrisk@social.treehouse.systemsX xrisk@social.treehouse.systems

                              @pseudonym is the problem the increased volume of code that the LLM is producing (as compared to the junior dev) — what you are calling “productivity gains"? because I can see this same argument being made for code produced by humans as well.

                              madhu_shrieks@mastinsaan.inM This user is from outside of this forum
                              madhu_shrieks@mastinsaan.inM This user is from outside of this forum
                              madhu_shrieks@mastinsaan.in
                              wrote last edited by
                              #14

                              @xrisk @mehluv might be able to provide more insight on this, but at least when I was writing content and AI was getting integrated into our work, the expectation was to review high volume of written content much faster for our editors. And we fully made many fuck ups due to that, because it is overwhelming. I assume this might also be the case, but I might be fully wrong. It is not just that the amount of code written is high volume, but also the expected pace of reviewing also is accelerated. Because what is the point of automating stuff if the reviewing process neutralizes the gains?

                              1 Reply Last reply
                              0
                              • xrisk@social.treehouse.systemsX xrisk@social.treehouse.systems

                                @pseudonym is the problem the increased volume of code that the LLM is producing (as compared to the junior dev) — what you are calling “productivity gains"? because I can see this same argument being made for code produced by humans as well.

                                malstrom@metalhead.clubM This user is from outside of this forum
                                malstrom@metalhead.clubM This user is from outside of this forum
                                malstrom@metalhead.club
                                wrote last edited by
                                #15

                                @xrisk @pseudonym Volume is a key factor here. But even if the volume was the same, LLMs are doomed to stagnate as devs—whose code was scraped for training data—are displaced.

                                xrisk@social.treehouse.systemsX 1 Reply Last reply
                                0
                                • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                                  If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                                  That's a cognitively brutal task.

                                  Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                                  I propose any productivity gains will be consumed by false negative review failures.

                                  ada@beige.partyA This user is from outside of this forum
                                  ada@beige.partyA This user is from outside of this forum
                                  ada@beige.party
                                  wrote last edited by
                                  #16

                                  @pseudonym That is why they don't replace juniors in aviation, nuclear, and radiology - only in non-critical industry.

                                  If the cost of potential failure times the estimated failing rate is smaller than the total labour cost of screening, interviewing, training juniors, plus firing cultural misfits - then business replaces it.

                                  Not only it saves HR operating cost and internal training cost - they can also hang a mistake on a senior reviewer.

                                  And the review model has a positive productivity projectile as they have a stable improvement curve, unlike human.

                                  1 Reply Last reply
                                  0
                                  • malstrom@metalhead.clubM malstrom@metalhead.club

                                    @xrisk @pseudonym Volume is a key factor here. But even if the volume was the same, LLMs are doomed to stagnate as devs—whose code was scraped for training data—are displaced.

                                    xrisk@social.treehouse.systemsX This user is from outside of this forum
                                    xrisk@social.treehouse.systemsX This user is from outside of this forum
                                    xrisk@social.treehouse.systems
                                    wrote last edited by
                                    #17

                                    @malstrom @pseudonym that’s an interesting claim. I don’t know enough about LLM research to make a judgement. I do know that LLMs trained on synthetic (other LLM-generated) data tend to perform worse, but have we reached the limits of what LLMs are capable of? In my limited understanding, if an LLM can “learn” fundamental programming “concepts” (the same way they can “learn” concepts across human languages — I could be wrong in my understanding here), they should (might?) be able to transfer/apply those concepts to not-before-seen domains (maybe with a bit of “reasoning” prodded in).

                                    wronglang@bayes.clubW 1 Reply Last reply
                                    0
                                    • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                                      If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                                      That's a cognitively brutal task.

                                      Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                                      I propose any productivity gains will be consumed by false negative review failures.

                                      moutmout@framapiaf.orgM This user is from outside of this forum
                                      moutmout@framapiaf.orgM This user is from outside of this forum
                                      moutmout@framapiaf.org
                                      wrote last edited by
                                      #18

                                      @pseudonym This.

                                      I do a lot of "computer science labs", where students learn to write code, and they wave me down when they have questions. When their code doesn't do what they expect, it's often easy to figure out what went wrong because you can spot a bit of code that looks funky. And usually, the problem is in those few lines.

                                      LLM code is meant to look like good code, so you don't get these little shortcuts.

                                      pseudonym@mastodon.onlineP 1 Reply Last reply
                                      0
                                      • pseudonym@mastodon.onlineP pseudonym@mastodon.online

                                        If you replace a junior with #LLM and make the senior review output, the reviewer is now scanning for rare but catastrophic errors scattered across a much larger output surface due to LLM "productivity."

                                        That's a cognitively brutal task.

                                        Humans are terrible at sustained vigilance for rare events in high-volume streams. Aviation, nuclear, radiology all have extensive literature on exactly this failure mode.

                                        I propose any productivity gains will be consumed by false negative review failures.

                                        toldtheworld@mastodon.socialT This user is from outside of this forum
                                        toldtheworld@mastodon.socialT This user is from outside of this forum
                                        toldtheworld@mastodon.social
                                        wrote last edited by
                                        #19

                                        @pseudonym I have posed this conundrum before and the answer I received is that there is also an opportunity cost to not moving faster and the risk of a catastrophic bug may not outweigh the risk of being overtaken by competitors, especially since that was already happening before LLMs anyway.

                                        Also, it *seems* models are improving at detecting these bugs, so they are being used to review changes, which, for the reasons you point out, they might be better at than people.

                                        robotistry@mstdn.caR pseudonym@mastodon.onlineP 2 Replies Last reply
                                        0
                                        • xrisk@social.treehouse.systemsX xrisk@social.treehouse.systems

                                          @malstrom @pseudonym that’s an interesting claim. I don’t know enough about LLM research to make a judgement. I do know that LLMs trained on synthetic (other LLM-generated) data tend to perform worse, but have we reached the limits of what LLMs are capable of? In my limited understanding, if an LLM can “learn” fundamental programming “concepts” (the same way they can “learn” concepts across human languages — I could be wrong in my understanding here), they should (might?) be able to transfer/apply those concepts to not-before-seen domains (maybe with a bit of “reasoning” prodded in).

                                          wronglang@bayes.clubW This user is from outside of this forum
                                          wronglang@bayes.clubW This user is from outside of this forum
                                          wronglang@bayes.club
                                          wrote last edited by
                                          #20

                                          @xrisk @malstrom @pseudonym just for clarity, LLMs don't learn concepts

                                          pseudonym@mastodon.onlineP 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups