#Mythos finds a #curl vulnerability

david_chisnall@infosec.exchange

AI powered code analyzers are significantly better at finding security flaws and mistakes in source code than any traditional code analyzers did in the past

I’m not sure this follows from what you’ve said in the rest of the post. Static analysers and fuzzers also made it very easy for people to find vulnerabilities and typically found a lot when they were deployed for the first time. And both were a lot cheaper to run than something like Mythos.

They aren’t finding as many vulnerabilities now because projects that are critical for security are integrating them into their CI flows.

And this is what always happens with some new technique: valgrind, Coverity, sanitisers, fuzzers, and so on: they’re released, they find a load of bugs that existing techniques failed to find, people fix them, they get integrated into regular CI runs, and the kinds of bugs that those tools find never make it into the tree.

Syskaller, for example, has found a lot more bugs in the Linux kernel than any Anthropic tools. And that’s just one fuzzing tool.

bagder@mastodon.social

@david_chisnall i think it makes sense for everyone to run the "easy" and cheap tools first, and once they all find no more problems, then you bring out the bigger canons like AI analyzers. So yeah, which is "best" ? It probably depends.

http_error_418@hachyderm.io

@bagder @david_chisnall I'm not going to advocate actually doing this because it's expensive and I'm not a fan of the environmental impacts, but I am curious what it would find if you pointed it at the codebase from a time before the other precursor tools like fuzzers were in use. How many bugs can it find that you know with hindsight are there to be found?

pozorvlak@mathstodon.xyz

@http_error_418 I agree, this would be a very interesting experiment - and potentially informative for other teams deciding where to spend limited developer time. @bagder @david_chisnall

oots@infosec.exchange

@bagder
In terms of evidence to the contrary:
Check out
https://social.security.plumbing/@freddy/116549451049357174 / the blog post:
https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/

>270 vulnerabilities found by Mythos fixed in a single Firefox release.

That's just one data point, but interestingly far off from yours.

lascapi@social.tchncs.de

I love it :

"The AI reviews are used in addition to the human reviews. They help us, they don’t replace us."

@bagder

gnirre@mastodon.social

@bagder How do you explain that Mythos found 271 bugs in Firefox, and counting, and only 1 in cURL. Is the Firefox code base 271 times larger?

synlogic4242@social.vivaldi.net

@bagder b-b-b-but curl is not in Rust!

bagder@mastodon.social

@gnirre I do not explain that at all because I don't have enough knowledge to do so.

david_chisnall@infosec.exchange

@http_error_418 @bagder

The original Coverity paper claimed, as I recall, 300 CVEs. I'm not sure what the severity distribution was, but that seems a lot more than Mythos, and they probably used less compute than a single Mythos query.

The problem with any static analyser, whether it's based on formal reasoning or pattern recognition, is that it will be unsound (i.e. it will have false positives, in contrast with dynamic analyses that are incomplete and have false negatives). The LLM-based tools are no different in this respect. From a Claude 'comprehensive code review' of one of my projects, the only serious bug in the top ten that it found was one that already had an open PR to fix, and two were not only not bugs, they were intentional design choices and doing it the other way would have caused serious performance regressions (and not fixed bugs).

The thing that does make Mythos different is that it tries to build a PoC exploit. This will reduce the false positive rate, at the expense of creating false negatives (if it can't produce a PoC, you ignore it).

When I've used Coverity on a large project, it's found tens of thousands of bugs, and most of them are false positives, so it requires a lot of effort to find the ones that are actually important bugs. Something that produces PoCs automatically would help this a lot.

The baseline data point I'd really like to see is something that integrates the clang analyser with libFuzzer. For each report the analyser finds, insert profiling points at the branches on the control flow chain that it recommends, then automatically drive the fuzzer to try to trigger the code paths that the analyser reported as potential issues.

The default settings for the clang analyser are compilation-unit-at-a-time and with reduced bounds on loop iteration counts to avoid using enormous amounts of memory. If you're willing to spend as much money as it costs to operate the LLM-based tools, you can use the cross-compilation-unit approaches and bump the state up a lot. Running it configured to use a comparable amount of RAM to the GPUs that the Anthropic models run on would let you do a lot of symbolic execution.

doragasu@mastodon.sdf.org

@bagder In line with what this blog post stated shortly after it was announced: the model is nothing special and much cheaper models can find the same bugs. Marketing BS turned to 11. https://www.flyingpenguin.com/the-boy-that-cried-mythos-verification-is-collapsing-trust-in-anthropic/

gnirre@mastodon.social

@bagder Did Anthropic know that you finally had gotten access to Mythos?

bagder@mastodon.social

@gnirre no idea, probably not

spitfire@mastodon.de

@bagder one? wow, that really was worth burning the planet's resources.

kleisli@mastodon.social

@quinn my current opinion: for security scans and reviews, AI tools are and will be useful, but not to generate code. @bagder

gnirre@mastodon.social

@bagder Maybe my question should have been if Alpha Omega knew? Your access was "inofficial"?

bagder@mastodon.social

@gnirre I don't know how much they asked or told A about when this was done. It's not "my" access, someone else has the access and ran the analysis

quinn@social.circl.lu

@kleisli @bagder
if it's something like 10,000 euros a pop, it might not be worth security scans and reviews, except for governmental clients.

frankgevaerts@mastodon.social

@synlogic4242 @bagder Yes, someone really needs to get on to that rewriting thing. Just a pity there hasn't been a weekend in *years* so nobody had the chance!

0x0@hachyderm.io

@quinn

Especially if it's subscription-based, as these models seem to be good at finding only specific sets of problems and then dry out, but even 10k per use is really gov or big corpo territory.

@kleisli @bagder

CIRCLE WITH A DOT

#Mythos finds a #curl vulnerability

Mythos finds a curl vulnerability

Mythos finds a curl vulnerability

Mythos finds a curl vulnerability

Mythos finds a curl vulnerability