Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Brite
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (Cyborg)
  • No Skin
Collapse
Brand Logo

CIRCLE WITH A DOT

  1. Home
  2. Uncategorized
  3. "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension.

"A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension.

Scheduled Pinned Locked Moved Uncategorized
25 Posts 17 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • codinghorror@infosec.exchangeC This user is from outside of this forum
    codinghorror@infosec.exchangeC This user is from outside of this forum
    codinghorror@infosec.exchange
    wrote last edited by
    #1

    "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

    henryk@chaos.socialH bms48@mastodon.socialB brianowen@fosstodon.orgB elrohir@mastodon.galE rndanger@infosec.exchangeR 10 Replies Last reply
    2
    0
    • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

      "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

      henryk@chaos.socialH This user is from outside of this forum
      henryk@chaos.socialH This user is from outside of this forum
      henryk@chaos.social
      wrote last edited by
      #2

      @codinghorror So ... roughly on par with normal developers then? 😉

      azuaron@cyberpunk.lolA 1 Reply Last reply
      0
      • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

        "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

        bms48@mastodon.socialB This user is from outside of this forum
        bms48@mastodon.socialB This user is from outside of this forum
        bms48@mastodon.social
        wrote last edited by
        #3

        @codinghorror I have witnessed cloud based LLMs giving me authoritative-sound answers, when asked questions about macOS xnu kernel code, which came from Linux constructions, and were completely irrelevant. Generative AI is often useless and not fit for its advertised purpose. The limitations are structural and well known to ML researchers, and Noam Chomsky called it already years ago. I wish they'd just jog on and stop bothering real people with what is essentially a shell game.

        bms48@mastodon.socialB 1 Reply Last reply
        0
        • bms48@mastodon.socialB bms48@mastodon.social

          @codinghorror I have witnessed cloud based LLMs giving me authoritative-sound answers, when asked questions about macOS xnu kernel code, which came from Linux constructions, and were completely irrelevant. Generative AI is often useless and not fit for its advertised purpose. The limitations are structural and well known to ML researchers, and Noam Chomsky called it already years ago. I wish they'd just jog on and stop bothering real people with what is essentially a shell game.

          bms48@mastodon.socialB This user is from outside of this forum
          bms48@mastodon.socialB This user is from outside of this forum
          bms48@mastodon.social
          wrote last edited by
          #4

          @codinghorror The really sad part of this was that the LLMs were caught directly in the act of fabricating erroneous output. I had a Git sparse-checkout of xnu directly in front of me. The technical matter related to TCP interactive behaviour.

          codinghorror@infosec.exchangeC 1 Reply Last reply
          0
          • bms48@mastodon.socialB bms48@mastodon.social

            @codinghorror The really sad part of this was that the LLMs were caught directly in the act of fabricating erroneous output. I had a Git sparse-checkout of xnu directly in front of me. The technical matter related to TCP interactive behaviour.

            codinghorror@infosec.exchangeC This user is from outside of this forum
            codinghorror@infosec.exchangeC This user is from outside of this forum
            codinghorror@infosec.exchange
            wrote last edited by
            #5

            @bms48 it's better for blank page ideation, and mashups / galactic brain fuzz testing in my opinion, and should always be double-checked by a human

            bms48@mastodon.socialB 1 Reply Last reply
            1
            0
            • R relay@relay.infosec.exchange shared this topic
            • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

              @bms48 it's better for blank page ideation, and mashups / galactic brain fuzz testing in my opinion, and should always be double-checked by a human

              bms48@mastodon.socialB This user is from outside of this forum
              bms48@mastodon.socialB This user is from outside of this forum
              bms48@mastodon.social
              wrote last edited by
              #6

              @codinghorror I gots no problem with da one-shotting da boilerplate! But the actual useful application is a far cry from what Jensen, who pretends to be everyone's friend, wants you to do the "tokenmaxxing" for.

              codinghorror@infosec.exchangeC 1 Reply Last reply
              0
              • em0nm4stodon@infosec.exchangeE em0nm4stodon@infosec.exchange shared this topic
              • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                brianowen@fosstodon.orgB This user is from outside of this forum
                brianowen@fosstodon.orgB This user is from outside of this forum
                brianowen@fosstodon.org
                wrote last edited by
                #7

                @codinghorror I'm genuinely curious how much of the hype is an army of mid level engineers all building the same five web apps as the other guy.

                dalias@hachyderm.ioD 1 Reply Last reply
                0
                • R relay@relay.mycrowd.ca shared this topic
                • henryk@chaos.socialH henryk@chaos.social

                  @codinghorror So ... roughly on par with normal developers then? 😉

                  azuaron@cyberpunk.lolA This user is from outside of this forum
                  azuaron@cyberpunk.lolA This user is from outside of this forum
                  azuaron@cyberpunk.lol
                  wrote last edited by
                  #8

                  @henryk @codinghorror I used to think the "just copying from Stack Overflow" jokes developers made were, you know, jokes. Then I watched all my coworkers embrace LLMs, and I was forced to conclude that they weren't joking.

                  So, depressing agreement.

                  1 Reply Last reply
                  0
                  • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                    "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                    elrohir@mastodon.galE This user is from outside of this forum
                    elrohir@mastodon.galE This user is from outside of this forum
                    elrohir@mastodon.gal
                    wrote last edited by
                    #9

                    @codinghorror I made a code going against "the recommended way of doing it" in several ways for certain task. It was a very conscious decision, these violations are needed to make something new I want. Recently I handed the code to my student and asked her to read and understand it. She handed me what she said was a written summary of her notes. It was a summary of the average recommended solutions on stack overflow, absolutely no mention of my very specific anti standard choices or code.

                    1 Reply Last reply
                    0
                    • brianowen@fosstodon.orgB brianowen@fosstodon.org

                      @codinghorror I'm genuinely curious how much of the hype is an army of mid level engineers all building the same five web apps as the other guy.

                      dalias@hachyderm.ioD This user is from outside of this forum
                      dalias@hachyderm.ioD This user is from outside of this forum
                      dalias@hachyderm.io
                      wrote last edited by
                      #10

                      @brianowen @codinghorror This is exactly what it is. This is exactly what the web dev industry has been for decades. Millions of LoC of garbage to justify prices for what should be an easy in-house job using an existing CMS with minimal or no code and should be as easy as using Excel.

                      tshirtman@mas.toT jesstheunstill@infosec.exchangeJ 2 Replies Last reply
                      0
                      • dalias@hachyderm.ioD dalias@hachyderm.io

                        @brianowen @codinghorror This is exactly what it is. This is exactly what the web dev industry has been for decades. Millions of LoC of garbage to justify prices for what should be an easy in-house job using an existing CMS with minimal or no code and should be as easy as using Excel.

                        tshirtman@mas.toT This user is from outside of this forum
                        tshirtman@mas.toT This user is from outside of this forum
                        tshirtman@mas.to
                        wrote last edited by
                        #11

                        @dalias @brianowen @codinghorror there are certainly many doing just that, but i'm probably not alone in doing something completely different, my vibe coded app for my own personal use is an openXR remote display for my quest3, in rust, with a desktop (linux/macos) agent capturing, encoding and streaming to it, using rust on linux, swift on macos, and python to wrap things.

                        it was done in weeks what would have taken me months/years, assuming i would have found the time/motivation to even try.

                        tshirtman@mas.toT 1 Reply Last reply
                        0
                        • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                          "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                          rndanger@infosec.exchangeR This user is from outside of this forum
                          rndanger@infosec.exchangeR This user is from outside of this forum
                          rndanger@infosec.exchange
                          wrote last edited by
                          #12

                          @codinghorror
                          So, they're all like the AI on LinkedIn that will do a "smart" search for me that takes 100 times longer to give me the exact same list as the normal search option does? Because it probably just runs the search and fails a thousand process calls before just giving me the search?

                          1 Reply Last reply
                          1
                          0
                          • tshirtman@mas.toT tshirtman@mas.to

                            @dalias @brianowen @codinghorror there are certainly many doing just that, but i'm probably not alone in doing something completely different, my vibe coded app for my own personal use is an openXR remote display for my quest3, in rust, with a desktop (linux/macos) agent capturing, encoding and streaming to it, using rust on linux, swift on macos, and python to wrap things.

                            it was done in weeks what would have taken me months/years, assuming i would have found the time/motivation to even try.

                            tshirtman@mas.toT This user is from outside of this forum
                            tshirtman@mas.toT This user is from outside of this forum
                            tshirtman@mas.to
                            wrote last edited by
                            #13

                            @dalias @brianowen @codinghorror (and i want to be very clear this is not my code, it's really codex/openai, i barely glanced at some of it, let alone write anything other than prompts, i'm a dev with years of experience, not the best in the world for sure, but i know how to code, here i've been doing a PM's job, and not a very competent one, the tool had to cater to my whims and half ideas, and did quite well at that)

                            1 Reply Last reply
                            0
                            • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                              "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                              dogiedog64@app.wafrn.netD This user is from outside of this forum
                              dogiedog64@app.wafrn.netD This user is from outside of this forum
                              dogiedog64@app.wafrn.net
                              wrote last edited by
                              #14

                              @codinghorror@infosec.exchange

                              Link Preview Image
                              I'M Shocked! - Futurama GIF - Shocker Shocked Futurama - Discover & Share GIFs

                              The perfect Shocker Shocked Futurama Animated GIF for your conversation. Discover and Share the best GIFs on Tenor.

                              favicon

                              Tenor (tenor.com)

                              1 Reply Last reply
                              0
                              • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                                "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                                overtondoors@infosec.exchangeO This user is from outside of this forum
                                overtondoors@infosec.exchangeO This user is from outside of this forum
                                overtondoors@infosec.exchange
                                wrote last edited by
                                #15

                                @codinghorror theft en masse as a business model

                                1 Reply Last reply
                                0
                                • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                                  "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                                  mergesort@macaw.socialM This user is from outside of this forum
                                  mergesort@macaw.socialM This user is from outside of this forum
                                  mergesort@macaw.social
                                  wrote last edited by
                                  #16

                                  @joe More of an FYI for this repost in case you’re curious. (It’s mentioned in the abstract.) https://macaw.social/@mergesort/116444049426350678

                                  joe@f.duriansoftware.comJ 1 Reply Last reply
                                  0
                                  • mergesort@macaw.socialM mergesort@macaw.social

                                    @joe More of an FYI for this repost in case you’re curious. (It’s mentioned in the abstract.) https://macaw.social/@mergesort/116444049426350678

                                    joe@f.duriansoftware.comJ This user is from outside of this forum
                                    joe@f.duriansoftware.comJ This user is from outside of this forum
                                    joe@f.duriansoftware.com
                                    wrote last edited by
                                    #17

                                    @mergesort sounds like a good opportunity for a one-up paper to try it again with the newer models. would be interesting to see what difference the "reasoning" really makes

                                    mergesort@macaw.socialM 1 Reply Last reply
                                    0
                                    • bms48@mastodon.socialB bms48@mastodon.social

                                      @codinghorror I gots no problem with da one-shotting da boilerplate! But the actual useful application is a far cry from what Jensen, who pretends to be everyone's friend, wants you to do the "tokenmaxxing" for.

                                      codinghorror@infosec.exchangeC This user is from outside of this forum
                                      codinghorror@infosec.exchangeC This user is from outside of this forum
                                      codinghorror@infosec.exchange
                                      wrote last edited by
                                      #18

                                      @bms48 turns out far too many humans are pretty goddamned lazy and will ship the prototype. How do we change this?

                                      chris@social.lane-jayasinha.comC 1 Reply Last reply
                                      0
                                      • joe@f.duriansoftware.comJ joe@f.duriansoftware.com

                                        @mergesort sounds like a good opportunity for a one-up paper to try it again with the newer models. would be interesting to see what difference the "reasoning" really makes

                                        mergesort@macaw.socialM This user is from outside of this forum
                                        mergesort@macaw.socialM This user is from outside of this forum
                                        mergesort@macaw.social
                                        wrote last edited by
                                        #19

                                        @joe Agreed! I’m genuinely always in favor of repeating research like this given how fast the models are moving. Even the non-reasoning models are dramatically better today so I’d love to run an experiment on them too, it’s just concerning to me when 1-2 year old outdated material becomes considered a source of truth.

                                        1 Reply Last reply
                                        0
                                        • codinghorror@infosec.exchangeC codinghorror@infosec.exchange

                                          "A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension. The researchers concluded that high performance often "results from verbatim reproduction of Stack Overflow answers rather than genuine reasoning." " https://www.researchgate.net/publication/403262523_Beyond_Code_Snippets_Benchmarking_LLMs_on_Repository-Level_Question_Answering

                                          rjohnston@techhub.socialR This user is from outside of this forum
                                          rjohnston@techhub.socialR This user is from outside of this forum
                                          rjohnston@techhub.social
                                          wrote last edited by
                                          #20

                                          @codinghorror I have yet to have an LLM tell me to RTFM and then end the conversation.

                                          codinghorror@infosec.exchangeC 1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups