"A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension.

tshirtman@mas.to

@dalias @brianowen @codinghorror there are certainly many doing just that, but i'm probably not alone in doing something completely different, my vibe coded app for my own personal use is an openXR remote display for my quest3, in rust, with a desktop (linux/macos) agent capturing, encoding and streaming to it, using rust on linux, swift on macos, and python to wrap things.

it was done in weeks what would have taken me months/years, assuming i would have found the time/motivation to even try.

rndanger@infosec.exchange

@codinghorror
So, they're all like the AI on LinkedIn that will do a "smart" search for me that takes 100 times longer to give me the exact same list as the normal search option does? Because it probably just runs the search and fails a thousand process calls before just giving me the search?

tshirtman@mas.to

@dalias @brianowen @codinghorror (and i want to be very clear this is not my code, it's really codex/openai, i barely glanced at some of it, let alone write anything other than prompts, i'm a dev with years of experience, not the best in the world for sure, but i know how to code, here i've been doing a PM's job, and not a very competent one, the tool had to cater to my whims and half ideas, and did quite well at that)

dogiedog64@app.wafrn.net

@codinghorror@infosec.exchange

https://tenor.com/view/shocker-shocked-futurama-im-shocked-gif-5296191

overtondoors@infosec.exchange

@codinghorror theft en masse as a business model

mergesort@macaw.social

@joe More of an FYI for this repost in case you’re curious. (It’s mentioned in the abstract.) https://macaw.social/@mergesort/116444049426350678

joe@f.duriansoftware.com

@mergesort sounds like a good opportunity for a one-up paper to try it again with the newer models. would be interesting to see what difference the "reasoning" really makes

codinghorror@infosec.exchange

@bms48 turns out far too many humans are pretty goddamned lazy and will ship the prototype. How do we change this?

mergesort@macaw.social

@joe Agreed! I’m genuinely always in favor of repeating research like this given how fast the models are moving. Even the non-reasoning models are dramatically better today so I’d love to run an experiment on them too, it’s just concerning to me when 1-2 year old outdated material becomes considered a source of truth.

rjohnston@techhub.social

@codinghorror I have yet to have an LLM tell me to RTFM and then end the conversation.

chris@social.lane-jayasinha.com

@codinghorror @bms48 change incentives to be for long term not quarterly. Give people doing work more autonomy to set their own standards. Possibly UBI will enable this shift in perspective from eeking out a paycheck to professional/citizen/human responsibility/opportunity.

codinghorror@infosec.exchange

@rjohnston I've never had that happen to me, personally, but I have pretty good resting bitch face to be fair.

jesstheunstill@infosec.exchange

@dalias @brianowen @codinghorror The number of billion dollar valuation security industry products that amount to a shiny web UI over a few FOSS tools ...

doragasu@mastodon.sdf.org

@codinghorror 0 surprise there.

codinghorror@infosec.exchange

@slyecho feel free to evaluate yourself using whatever tools you prefer

CIRCLE WITH A DOT

"A recent 2026 empirical study titled "Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering" (published on arXiv/ResearchGate) explicitly tested LLMs on codebase comprehension.