@bms48 @downey
-
The thing that’s hidden when projects get reports from Anthropic is how much human triage is needed.
I had someone send me a code review of one of my projects done with Claude 4.6 (which, apparently, is as good at Mythos at finding bugs but less good at producing PoC exploits). Of the top ten bugs, most were not bugs (e.g. missing null checks on things where the API contract requires non-null arguments). Two were intentional design choices and the proposed changes would have made things slower. One was a bug that needed fixing, but there was already an open PR to fix it before Claude looked at the project.
The signal to noise ratio is worse than Coverity, and FreeBSD hasn’t had the resources to triage / fix all of the issues the free Coverity scans found in 15 or so years of having access to it.
@david_chisnall @bms48 @downey
I have not been impressed by Coverity in Varnish/Vinyl context.
-
The thing that’s hidden when projects get reports from Anthropic is how much human triage is needed.
I had someone send me a code review of one of my projects done with Claude 4.6 (which, apparently, is as good at Mythos at finding bugs but less good at producing PoC exploits). Of the top ten bugs, most were not bugs (e.g. missing null checks on things where the API contract requires non-null arguments). Two were intentional design choices and the proposed changes would have made things slower. One was a bug that needed fixing, but there was already an open PR to fix it before Claude looked at the project.
The signal to noise ratio is worse than Coverity, and FreeBSD hasn’t had the resources to triage / fix all of the issues the free Coverity scans found in 15 or so years of having access to it.
@downey @david_chisnall @bms48 I have the same fear. On the other hand, Firefox 150 apparently fixes 251 bugs, and I wonder how they did it.
For a hardened target, just one such bug would have been red-alert in 2025, and so many at once makes you stop to wonder whether it’s even possible to keep up.
-
@david_chisnall @bms48 @downey
I have not been impressed by Coverity in Varnish/Vinyl context.
I did some triaging of libc reports in FreeBSD from Coverity about ten years ago. The false positive rate was very high, but lower than I’ve seen for Claude.
We use the clang analyser in CI for CHERIoT RTOS (and our clang has a growing number of CHERI-specific analyses). It isn’t as good as Coverity in general but it has found real bugs in my code prior to merging PRs.
It’s much easier to use for a new project than an established one. Each time we turn on a new analysis there is a period of checking each report and adding comments to silence it if it’s a false positive, but it’s fairly short. A project that’s already millions of lines of code, going from nothing to all of the analyses, just has a huge pile of things to wade through.
-
The thing that’s hidden when projects get reports from Anthropic is how much human triage is needed.
I had someone send me a code review of one of my projects done with Claude 4.6 (which, apparently, is as good at Mythos at finding bugs but less good at producing PoC exploits). Of the top ten bugs, most were not bugs (e.g. missing null checks on things where the API contract requires non-null arguments). Two were intentional design choices and the proposed changes would have made things slower. One was a bug that needed fixing, but there was already an open PR to fix it before Claude looked at the project.
The signal to noise ratio is worse than Coverity, and FreeBSD hasn’t had the resources to triage / fix all of the issues the free Coverity scans found in 15 or so years of having access to it.
@downey @david_chisnall @bms48
a code review of one of my projects done with Claude 4.6 (which, apparently, is as good at Mythos at finding bugs but less good at producing PoC exploits)
There is a huge difference there, though: a pipeline producing actual PoC exploits implicitly filter out all reports that are not actionable, so it produces far less false positives (if at all, depending on the internal validation done on the exploits). -
I did some triaging of libc reports in FreeBSD from Coverity about ten years ago. The false positive rate was very high, but lower than I’ve seen for Claude.
We use the clang analyser in CI for CHERIoT RTOS (and our clang has a growing number of CHERI-specific analyses). It isn’t as good as Coverity in general but it has found real bugs in my code prior to merging PRs.
It’s much easier to use for a new project than an established one. Each time we turn on a new analysis there is a period of checking each report and adding comments to silence it if it’s a false positive, but it’s fairly short. A project that’s already millions of lines of code, going from nothing to all of the analyses, just has a huge pile of things to wade through.
@david_chisnall @bsdphk @downey clang-cfi made it onto my C++ tools list yesterday AM when digging further already on hardware capabilities enforcement SoTA
-
@downey @david_chisnall @bms48 I have the same fear. On the other hand, Firefox 150 apparently fixes 251 bugs, and I wonder how they did it.
For a hardened target, just one such bug would have been red-alert in 2025, and so many at once makes you stop to wonder whether it’s even possible to keep up.
It’s not clear how many of those were serious and how much they were triaged before being handed to the Firefox team.
To put that number in perspective, there was a paper on FFI bugs a few years ago that found around 300 vulnerabilities in Chromium’s DOM to JavaScript bindings. That’s a single bug class in a single subsystem of a browser. Chromium since moved to more machine-generated code for this boundary and eliminated most of that bug class by construction.
-
It’s not clear how many of those were serious and how much they were triaged before being handed to the Firefox team.
To put that number in perspective, there was a paper on FFI bugs a few years ago that found around 300 vulnerabilities in Chromium’s DOM to JavaScript bindings. That’s a single bug class in a single subsystem of a browser. Chromium since moved to more machine-generated code for this boundary and eliminated most of that bug class by construction.
@david_chisnall @lapo @downey The claim that many of the claimed bugs Mythos found in Firefox required taking down the sandbox persists
-
System shared this topic
-
@downey @david_chisnall @bms48
a code review of one of my projects done with Claude 4.6 (which, apparently, is as good at Mythos at finding bugs but less good at producing PoC exploits)
There is a huge difference there, though: a pipeline producing actual PoC exploits implicitly filter out all reports that are not actionable, so it produces far less false positives (if at all, depending on the internal validation done on the exploits).Yes, that’s kind-of true, but context matters. The reports I saw were for a library, so the context is callers of the library. One report, for example, was in a function that is called by compiler-generated code. It would crash if passed a null pointer, but the compiler will never pass it a null pointer. A fuzzing harness could easily generate a test case that passed it a null pointer. Adding a null check there would have impacted performance of the hottest code path in the entire library.
The Firefox reports that Anthropic made public weren’t in Firefox, they were in Spidermonkey running in a test harness. How many bugs were reachable by the test harness but not by Firefox? Especially since Spidermonkey runs in the sandboxed child process in Firefox, which is assumed according to the threat model to be compromised.
-
Yes, that’s kind-of true, but context matters. The reports I saw were for a library, so the context is callers of the library. One report, for example, was in a function that is called by compiler-generated code. It would crash if passed a null pointer, but the compiler will never pass it a null pointer. A fuzzing harness could easily generate a test case that passed it a null pointer. Adding a null check there would have impacted performance of the hottest code path in the entire library.
The Firefox reports that Anthropic made public weren’t in Firefox, they were in Spidermonkey running in a test harness. How many bugs were reachable by the test harness but not by Firefox? Especially since Spidermonkey runs in the sandboxed child process in Firefox, which is assumed according to the threat model to be compromised.
@david_chisnall @lapo @downey Looks like we just synchronously converged on the same latter point... as for the former I'm looking at defensive use of modern C++ nullptr in concepts, contracts and other mechanisms. I still have the last High Integrity C++ draft to review but it is looking rather dated just now.
-
@david_chisnall @lapo @downey Looks like we just synchronously converged on the same latter point... as for the former I'm looking at defensive use of modern C++ nullptr in concepts, contracts and other mechanisms. I still have the last High Integrity C++ draft to review but it is looking rather dated just now.
-
@david_chisnall @lapo @downey Sweedack. Modules still strike me as knitting yoghurt. Kenton Varda's standing advice to eliminate singletons on site (for capability reasons) is very sound. https://kentonshouse.com/singletons
-
@david_chisnall @lapo @downey Sweedack. Modules still strike me as knitting yoghurt. Kenton Varda's standing advice to eliminate singletons on site (for capability reasons) is very sound. https://kentonshouse.com/singletons
@david_chisnall @lapo @downey @ludicity "Sweedack" is an oblique reference to John Brunner's seminal science fiction novel "The Shockwave Rider" which reads very differently in the now, and clearly inspired someone (swr) who worked at Entercept on their form of Domain & Type Enforcement (DTE), just before I interviewed there in 2001 as the dot-com crash was about to happen, when I'd had the mess of eTrust to deal with inside JPMorganChase as 3rd line security, adding net promisc logs to Solaris.
-
@david_chisnall @lapo @downey @ludicity "Sweedack" is an oblique reference to John Brunner's seminal science fiction novel "The Shockwave Rider" which reads very differently in the now, and clearly inspired someone (swr) who worked at Entercept on their form of Domain & Type Enforcement (DTE), just before I interviewed there in 2001 as the dot-com crash was about to happen, when I'd had the mess of eTrust to deal with inside JPMorganChase as 3rd line security, adding net promisc logs to Solaris.
@david_chisnall @lapo @downey @ludicity That would be the individual who assumed swr as his nom-de-plume: https://phrack.org/issues/56/4
-
R relay@relay.publicsquare.global shared this topic