Introduction #

In a previous post, I gave an overview of various ‘AI Security Engineer’ products on the market, which purport to find real vulnerabilities and bugs in codebases with nothing but their source code (i.e. static analysis). I detailed my experience in testing most of the products on the market against both known-vulnerable (or malicious) code, as well as various open-source codebases. That post blew up, especially in security and open-source communities, in part due to my calculated decision to test the AI scanners on the curl codebase which has famously denounced so-called “AI slop” bug bounty reports that they regularly receive. curl’s maintainer Daniel Stenberg described me as a “clever human” (what a compliment), and blogged about his experience with the raw scan results from my scans, and even mentioned that “A good tool in the hands of a competent person is a powerful combination” (again, what a compliment). I also discussed my experience with using these scanners on the Open Source Security podcast, where I even revealed to the world that I’ve been working full-time at a cryptocurrency firm since January. Approximately 98% of the bugs reported and fixed in the curl codebase were discovered by ZeroPath. At the moment, the curl project has fixed around 150 of the bugs detected by ZeroPath, but I believe that by the end of next week, it will be around 200. That’s kind of amazing, and certainly proves to me that AI static analyzers can actually provide real value, especially in large-scale codebases. With all of that said and done, I now have more to discuss based on the things I’ve learnt, seen, and done, since I posted my original post.

In this post, I will discuss some of the other scanners that I learned about, some that I gained access to and tested, and some more thoughts on AI, security, and the world itself.

AI Security Scanners #

In my previous post, I mentioned that by far the best scanner that I tested was the product called ZeroPath, where I was blown away by the results from scanning both private code, and open-source code. Since that post, I have been sent a few other products which I wanted to detail as well.

Fraim, which is an open-source product (although it seems to just be a hobby project).
LAST, which is an open-source product developed by Latio Tech (it’s really just a proof of concept, not a real product).
metis, which is an open-source product developed by Arm.
Ghost Security, which has one of the .. cheesiest websites (and general marketing/company scheme) I’ve ever seen.
Aisle, which was just released a few days ago.
Endor Labs, which isn’t actually an AI scanner but for which .. uses AI to find vulnerabilities in AI-generated code as well as actual AI systems (as well as in dependency vulnerability checking à la SCA and reachable vulnerable CVEs).
HackerOne Code, which is a rebranding of HackerOne’s pullrequest.com.
TerraSecurity, which .. well it’s hard to say, because their website doesn’t actually say what they do. But it seems they are some type of external “AI-powered” security scanner (similar to XBOW).
DryRun Security, which is not new at all and I even mentioned it in my original post, but I finally got access to test it.

I have not tested all of these products out for various reasons (time constraints, setup requirements, etc), but I will note all of them.

Fraim ##

Fraim seems to be a hobby project which isn’t necessarily trying to be the best on the market – and I commend them for that. Unfortunately, when I tried to set Fraim up, it didn’t work for me at all. The documentation, when followed, did not allow the script to actually run. Other attempts just resulted in other errors. Nonetheless, it seems like an interesting small project that probably does something when it actually works – it’s just that initial setup is difficult.

LAST ##

I didn’t test LAST at all, but speaking with its developer, it seems development is kind of “finished” with it. As he said, “I realized to do more I’d have to build an actual SAST and didn’t want to”. Reading the source code, it seems this basically just funnels files into a commercial LLM, and has some pre-defined “responses” to do few-shot prompting.

metis ##

This one is probably the most interesting of the three open-source products, simply because it’s developed by ARM. I also didn’t try this one out, but reading the code, it seems to work as follows:

It indexes the codebase into some type of storage (like ChromaDB) using chunking.
It retrieves code context for each file (or change if it’s a diff/PR scan) using RAG.
Prompts some LLM with language-specific templated prompts that look for some specific vulnerabilities.
Reports findings.

There doesn’t seem to be a false positive detection stage in metis, but that’s fine for this type of open-source non-security-company product. Probably if you’re a small product security/appsec team that wants something completely local to run or don’t want to spend a measly $40-50/developer for a commercial scanner, this is likely your best bet. While it certainly won’t be able to compare with commercial products, it does seem like quite a nice small project. At the moment, it only supports scanning of C, C++, Python, and Rust code.

Ghost Security ##

At first, I thought Ghost Security was not a real product, because their homepage talks about “exorcists”, “ghosts” “scary”, “freakish vulns”, and their whole motif is surrounded around ghosts. While in a way that’s kind of .. cute, I sort of just rolled my eyes at it all, and just wanted to know what they are actually doing and what their product is good for. Also, their homepage has no link to the actual application or login page.

They are a real product though, and I tested it out. The UI was a bit clunky (and there was a delay of 60-minutes to add repositories), but eventually I got some scans going.

I was told “we are designed specifically for evaluating vulns on only web-applications. We train our agents on specific web app frameworks to get higher precision and lower FPs”, so I scanned some React project, and .. all I got was false positives; and really bad ones, too. For example, the scanner said that an XSS was found due to the code <Link href={var}>. Anybody with some React experience knows that this is not vulnerable to XSS at all. In response to this, the Ghost Security team said that “we just added frontend agents in the last few days – clearly we have some tuning work to do there.” As far as I can tell, they pulled React support when I told them about these false positives, which actually .. I appreciate.

When asked whether I could export the findings, I was simply given a link to the API – not so helpful. Afterwards, I was linked to the project gregcmartin/ghost-sarif on GitHub, which the Ghost Security people claimed was a “quick API client -> SARIF tool you can try out”. When I tried the tool, it .. did not work. It had been vibe-coded and seemingly not tested, with hallucinations about API endpoints and parameters scattered throughout the source code. I sent a few patches which actually made it work, but this was what I would call .. not cool:).

I continued testing the scanner out with another web application (Java with Spring Boot), and there were way way way too many false positives for me to actually spend any sort of time on looking at real results.

I got the impression that this product may one day be not so bad, especially if they focus on specific frameworks to not fall into the “be bad at everything” rather than “be good at one thing” category. However, right now, my tests didn’t give me confidence in their product today. Their lame/tacky/annoying “ghost” motif was also distracting.

Aisle ##

I have not tested Aisle, especially because they only seemingly released a few days ago. They have some impressive names behind the scenes working there, though.

Endor Labs ##

I have not tested Endor Labs, but I like the idea of deliberately scanning AI-generated code for vulnerabilities which AI code generators commonly write (and at the end of the day: what a huge waste of money and electricity, but whatever).

They also seem to be heavily focused on scanning source code for vulnerable dependencies, and determining whether the (normally absolutely horrible spam) public vulnerabilities/CVEs of dependencies actually adds any risk to your codebase in the real world (they very very rarely do). Basically AI-based triaging of public vulnerabilities in dependencies of your codebase.

HackerOne Code ##

HackerOne Code comes from an acquisition of pullrequest.com, and is a pull request vulnerability scanner, analyzing changes made in codebases before they land. The idea is that the system scans pull requests to your codebase, and then .. as I understand .. basically offloads false positive detection to real humans. My experience with HackerOne (and all of the bug bounty platforms) has been absolutely abysmal (because it incentivizes low impact high noise vulnerability reporting (see Goodhart’s law), so it’s difficult to tell how this system works in practise. How can a random human triager, at the click of a button, understand the intricate details, invariants, constructs, and functionality, of a large codebase in order to verify issues independently? Sure, they might be able to say “yes, definitely this is vulnerable to sql injection”, but what about anything that isn’t .. the typical vulnerabilities that everybody knows, like the actual important stuff that requires a brain to think about, like design flaws, logic flaws, and issues that are specific to the very codebase at hand? Anyways, I didn’t test it, but it was interesting to hear that it exists. I have no other opinions about this system.

TerraSecurity ##

As noted, it’s not really obvious what TerraSecurity actually does. Their homepage states that “Agentic-AI powered, continuous web app penetration testing. Terra’s AI agents are supervised by human expert testers, with unparalleled coverage, full business context and real-time adaptability”. What does “supervised by human expert testers” mean? What does “unparalleled coverage” mean? What does “full business context and real-time adaptability” mean? Their website is also kind of broken, with certain logos not loading in their “Testimonials” section.

Although I didn’t get to test their product (or even really find out what it is), I wanted to write it down anyways. I highly suspect this is some kind of DAST which runs scans on a live website, with a human triaging findings or pointing the system towards certain sections of a website. I am not so bull-ish on DASTs, and I think that type of dynamic testing is like playing the lottery, the slot machines, or just monkeys writing Shakespeare.

DryRun Security ##

I mentioned DryRun in my previous post, and that I wasn’t able to test their product due to an annoying licensing issue (required a first-month-free-cancel-any-time contract that I couldn’t be bothered to deal with). Following my post, the founder gave me access to try it out.

As suspected, the scanner was pretty good. However, they currently only scan PRs, and I couldn’t scan a full repository like curl. Based on the scanning I did, I would say that DryRun’s scanner fits somewhere around 2nd or 3rd place on the leaderboard that I previously compiled; i.e. between ZeroPath (1st). Apparently they’ll be adding a full-repo scanner soon, so I hope to try it out.

Using AI source scanners in practice #

Although these products market themselves as AI security scanners, as I’ve shown previously, they can be used to find critical bugs in codebases too, when prompted. I have continued to use ZeroPath to scan some codebases (essentially because it seems to be the only product that left such a good impression on me, it really works, it actually does something, does something well, and does exactly what I want), and have been reporting more issues to projects like curl, openssl, and some others. In a way, it’s all very addictive like a gambling slot machine: you click a button, you wait a minute, and you get the reward (the scan results) back (“just a few more LLM tokens bro, I promise, last time”).

What’s also interesting about these scanners is that the good ones have false positive detection, which really seem to work quite well. From the testing with curl, Daniel reported that around 20% of the issues found by ZeroPath were false positives.

It’s funny. LLMs seem to have this problem where they try to show something is “real” or “true” as much as it can, and effectively lie to the user, so false positive detection seems like it should be a difficult problem for AI source code scanners, but this has seemingly been dealt with quite well. Indeed for ChatGPT, but not-so for these scanners:


ChatGPT vs. StackOverflow

What’s also interesting is the whole RAG and codebase compression, which some of these tools use to analyze the source code. It’s kind of counter-intuitive, but compressing the codebase into vectors seems to give higher quality results than just working with the large codebases in raw format. This reminds me of my thesis in applied mathematics, where I looked at how trends of data with machine learning techniques could actually yield better and more accurate results than working with raw data (e.g. instead of using machine learning techniques on the raw data, you use it on an approximation represented by some polynomial). That was a long time ago, though.

curl bug reports ##

Some of the bugs reported in curl are really amazing, and I wanted to highlight some of them. The bugs aren’t necessarily critical, but the fact that they were discovered with an AI scanner is fascinating. If you’re interested in viewing all of the curl issues, they can be found here and here, of which 98% of the issues were found using ZeroPath.

Of these results, Daniel even publicly stated that he was “almost blown away by the quality of some of [the findings from ZeroPath]”, with some “actually truly awesome findings”. Indeed, after being so publicly vocal against AI-generated bug reports (or more specifically, “AI Slop”, so just invalid AI-generated bug reports), it’s kind of amazing that he has seemingly come around to the idea that AI can actually provide value here.

Anyways, I’ve compiled a list of some interesting (in my opinion) findings.

Developer Intent vs. Code ###

rustls ####

In curl’s rustls integration, if a file size was a multiple of 256 bytes, curl would error out.

SASL ####

In curl’s SASL implementation, incorrect bit manipulation failed to disable certain SASL mechanisms if they were advertised by the server but not supported by curl. The scanner noticed the developer intent from the comment in the code was different than the actual code written:

    /* Remove the offending mechanism from the supported list */
    sasl->authmechs ^= sasl->authused;

HTTP/3 ####

In curl’s HTTP/3 integration, the idle timeout checking code was incorrectly using nanoseconds to calculate whether a connection has timed out. In addition to that, the scanner also worked out that the documented feature of an optional “no-timeout” was completely ignored if used.

SSH/SFTP ####

The scanner worked out that if a connection was resumed while uploading a file over ssh/sftp, the file would be truncated with no error reported, meaning an incomplete file would be uploaded.

RFC Violations ###

This is what I found the coolest during my expedition into these tools. No human code reviewer (in a high-level review) is going to go off and read and understand the RFC for every single component of a tool like curl. But the scanner did!

Telnet ####

In curl’s Telnet implementation, sub-negotiation payloads were written without escaping IAC (0xFF) symbols. If any user-controlled value contained 0xFF, it could be parsed as a command midstream and break the negotiation. The fix here was to just refuse any IAC symbols, as there’s little legitimate real-world usage of this symbol.

Another bug in telnet that was discovered related to an error occurring in which case an error message was logged, but … the program continuing anyway (and not returning the error).

Dead Kerberos Code ####

There was a buffer overflow in Kerberos FTP handling. However, as it turned out, the whole Kerberos code was broken, and if anybody in the past year or so had actually used curl with Kerberos FTP since then, they would have been completely unable to. In the end, the solution to this was .. to completely remove Kerberos FTP in curl. That’s a big win IMO. Having a tool like this which can point out broken code that is unmaintained (which is a security issue in and of itself), gives leverage to security teams to sunset code that will one day just, simply, break. Finally, a “business reason” to delete old code and reduce attack surface.

TFTP ####

The TFTP RFC states that the client must stick to the server’s first-chosen UDP port for the whole transfer, and any other packets from other ports should be discarded. The scanner, somehow equipped with this knowledge (seriously, wtf), discovered that packets were not validated against the initially chosen server port, which allowed an on-path or same-network attacker to inject a valid looking DATA or OACK and hijack the transfer.

SMTP ####

The RFC for SMTP states that certain keywords in the exchange between server and client must be treated as case-insensitive. The scanner discovered that the parsing of these keywords was case-sensitive. This could lead to situations where encryption would not be used for communication, when an SMTP server responded with a lowercase keyword.

IMAP ####

Just like SMTP, the RFC for IMAP states that certain keywords in the exchange between server and client must be treated as case-insensitive. The scanner discovered that the parsing of these keywords was case-sensitive. This could lead to situations where encryption would not be used for communication, when an IMAP server responded with a lowercase keyword.

Documentation vs. Reality ###

Like the mismatch of developer intent and real code, documentation was also found to be either outdated or just flat-out incorrect, resulting in broken code with broken contracts. In the following report, the scanner correctly identified that the documentation for the function Curl_resolv stated that the function parameter entry may be NULL. That actually isn’t allowed though, and if anybody had done so, the program may have crashed in some circumstances.

# `Curl_resolv`: NULL out-parameter dereference of `*entry`

* **Evidence:** `lib/hostip.c`. API promise: "returns a pointer to the entry in the `entry` argument (**if one is provided**)." However, code contains unconditional writes: `*entry = dns;` or `*entry = NULL;`.  
* **Rationale:** The API allows `entry == NULL`, but the implementation dereferences it on every exit path, causing an immediate crash if a caller passes `NULL`.

Memory Leaks ###

A whole ton of issues related to memory management were found, with a ton of memory leaks and file descriptor leaks found in nearly all functionality of curl. In a few cases, the scanner discovered that the incorrect memory-management functions were used (e.g. calling the incorrect freeing function, given how the memory was allocated).

Other ###

In the case of an error in handling a certificate revocation, a variable containing an error message was set after the error message was actually used. That’s kind of funny.

In a test program, an unimplemented program flag was sitting in the output of --help, with no associated code.

Conclusion #

In my previous post, I concluded by saying “the biggest value I’ve seen so far is not just in finding vulnerabilities, but in surfacing inconsistencies: mismatches between the developer intent and actual implementation, mismatches between business logic and reality, broken assumptions hidden deep in the code, and logic that simply doesn’t make sense when you look at it twice.”

My conclusion has not changed since. Not only can these scanners discover the issues that traditional static analysis can find, they also find things that are simply not possible with normal static analysis. I see traditional static analyzers as spell-checkers, because they effectively have a pre-defined list of mistakes or errors that they look for, while these AI static analyzers are more like grammar checkers, because they use context to determine whether something is a mistake or not. Indeed, the irony of all of this is that static analysis is by definition just pattern detection; while these AI analyzers are detecting “logic” or “reasoning” – but at the same time, the foundation of LLMs and AI is .. pattern matching! But they’re matching patterns against basically .. the whole internet (the training data, documents, stackoverflow discussions, whatever), rather than just some small queries or rules like traditional SASTs.

This is all an exciting area for cybersecurity and I hope to continue doing research into the market, because this really seems like the future – and this is only beginning. Indeed, as Daniel Stenberg noted, “_in the curl project we continuously run compilers with maximum pickiness enabled and we throw scan-build, clang-tidy, CodeSonar, Coverity, CodeQL and OSS-Fuzz at it and we always address and fix every warning and complaint they report so it was a little surprising that this tool now suddenly could produce over two hundred new potential problems. But it sure did. And it was only the beginning.”

Joshua.Hu | Joshua Rogers' Scribbles

About Me Services Projects Ideas Curriculum Vitae Contact

Retrospective: AI-powered security engineers and source code scanners