Anthropic's open-source framework for AI-powered vulnerability discovery

(github.com)

224 points | by binyu 4 hours ago

23 comments

tptacek 3 hours ago
The thing about things like this is that they're shop jigs. You can buy a crosscut sled if you really want to, but most woodworkers just make their own.
It was a different situation 2 years ago, when there was significant cost to building your own harness (but then: you probably weren't doing AI vuln research 2 years ago). Today, I think your best bet is to look at something like this for ideas, and then just ask for your own, to fit your own work style, with your own interface, your own notion of target and effort specification, and your own alerting.
[-]
- redfloatplane 2 hours ago
  "Shop jigs" is a great way to put it. I think a lot of software has gone from being made for general use to extremely individualised use. Before the Age of AI, it took so much human effort to write something that solved your problem that you might often go the extra mile so that others could re-use it. Now, it takes almost no effort, so the software stays ungeneralised. Some of the incentive has changed, I think. Most of the time I no longer share the things I've been building[0] because, for one thing they simply couldn't possibly have any benefit for others, and if they need something like it, they can build exactly the thing they want instead of having to extend or modify my thing. Like a jig!
  0: https://redfloatplane.lol/blog/17-why-share/ (and related posts, I guess)
  [-]
  - colmmacc 1 hour ago
    Unless it is very specific to a proprietary product, craftspeople take their jigs with them from job to job, building up a personal library over a career. As a software developer I've always had a well-tuned IDE and shell config in a safe place.
    Something I think about a lot is what is the equivalent for the software builders of today using AI tools? how do make these harnesses exportable and portable? You might think employers would be against this; make it more costly to leave. But I actually think most will favor this because it makes people more productive more quickly. But we have to find ways to normalize it and show that there are no security leaks in the process (like might make it in to a set of personal steering prompts).
    [-]
    - aquajet 51 minutes ago
      Using something like pi helps. I've made my own dotfiles for skills/extensions I like and can install them just like my normal dotfiles
      https://github.com/anishthite/agent-dotfiles
    - worldsayshi 59 minutes ago
      > craftspeople take their jigs with them from job to job
      Except for software gigs the software typically belongs to the customer so you'd need to rewrite it every time...
      [-]
      - borski 45 minutes ago
        Depends. If you are a contractor, like most craftspeople, your tools are your own.
    - jaxn 1 hour ago
      i have been thinking about this from a different direction: how do we make these shared within a company in a way that increases the productivity floor of the team/department/company. Sure, they can still be extended/enhanced by individuals, but we don’t need everyone configuring mcps, building institutional memory, etc.
      for me, it’s not about the cost to leave, it’s about lowering the cost of onboarding and change.
  - andhug 2 hours ago
    That’s an interesting way to say “code quality in the age of ai has gone out the window”
    [-]
    - drtz 1 hour ago
      Are you suggesting that performing a specific task without unnecessary abstractions is indicative of poor quality?
- jorl17 1 hour ago
  This is exactly it.
  I've said many times that I believe "using the computer will transparently involve having it write and run code for you" (and if you're not technical you won't even know it!). What you're saying goes in that direction as well.
  I feel that it's often better for us to create purpose-built tools for our lives, and with every model release, the complexity of those tools grows.
  These are really personal tools: they solve a problem that other people might have, but are very tied to your own specific way of working, and would be hard to explain or adapt to someone else. So: shop jigs.
  I have about 10 custom scripts and programs that are like this -- I haven't felt like this since college! Back then I had all the time in the world to customize my setup...now I have agents!
  In a way, I want to show this to all my friends, but whenever I mentally trace how that would go, I realize they wouldn't really understand a bunch of the quirks they have, because they are _my_ quirks. They're reasonably complex pieces of tech that solve my problems very well, which are themselves particular versions of broader problems, and which I (at least for now) have no interest in supporting.
  It's so clear we're heading in this direction, and yet so many people still believe code will be for the elites. Maybe production-code...As for the rest, I think soon your mom and dad are going to have their computer running code it wrote to serve them. Security-wise it's scary, but it's exciting to think about!
- borski 48 minutes ago
  I agree with this wholeheartedly.
- sieabahlpark 3 hours ago
  [dead]
- zuzululu 2 hours ago
  [flagged]
  [-]
  - ryancw 2 hours ago
    As a woodworker, it’s a really nice analogy and beyond anything I’ve seen AI do.
    [-]
    - zuzululu 2 hours ago
      No idea why people are so upset I genuinely thought his references using analogy was a typical AI slop comment that I'm used to seeing from chatgpt
      [-]
      - Retr0id 1 hour ago
        Believe it or not, people have been making analogies since before AI
        [-]
        zuzululu 1 hour ago
        I believe you and you can also believe AI is pretty good at it too.
        timacles 1 hour ago
        They used to, they still do, but they used to too.
  - ghhhibhc 2 hours ago
    It really doesn’t
    [-]
    - zuzululu 2 hours ago
      [flagged]
      [-]
      - sermah 2 hours ago
        user: zuzululu created: 47 days ago karma: 228`
        ok
simonw 3 hours ago
I wonder how much this thing costs to run.
https://github.com/anthropics/defending-code-reference-harne... says:
> As a rough guideline, expect ~10K uncached input tokens/min and ~2K output tokens/min per agent. You can scale parallelism up to your account's ITPM limit (roughly 10 agents per 100K ITPM).
My guess would be hundreds of dollars with Opus and thousands of dollars with Mythos.
[-]
- nikcub 3 hours ago
  It's becoming apparent that it requires more tokens to secure code than it does to write it
  May even be an order of magnitude more
  [-]
  - Mtinie 3 hours ago
    In all seriousness, wasn’t that always the case? Writing bad code is relatively cheap.
    Ensuring code isn’t bad is the expensive part.
    [-]
    - chrisweekly 1 hour ago
      Sort of?
      The definition of "bad" from a security PoV is rapidly expanding, in light of relatively new capabilities and increasingly cheap access to exploitable vulnerabilities.
      [-]
      - fny 22 minutes ago
        I don't think the definition of "bad" is expanding. Rather the ability to detect and exploit "bad" is.
  - tptacek 3 hours ago
    For now, maybe, yes? But the most important targets of this kind of work aren't AI outputs; it's legacy code, particularly (but not exclusively) old memory-unsafe code. In those situations the figure of merit isn't the token cost of recreating the target code; it's the cost of finding the same bugs with humans or preexisting tools.
    Those costs can be extremely high.
    [-]
    - thisogood 2 hours ago
      [dead]
    - ath3nd 3 hours ago
      Any newly produced AI code is immediately legacy and trash at the same time.
  - windexh8er 2 hours ago
    Given the slop that's made its way to Github we can see that this is a great profit model. Ship slop and then "fix" slop. What an efficient use of our planet!
  - bflesch 3 hours ago
    It's weird because why can't they train the AI to simply output secure code?
    The basic security flaws with regards to input validation and overflows should never ever be output by an AI. For "security flaws due to bad design" I'll cut them slack until AGI is achieved.
    [-]
    - simonw 2 hours ago
      > It's weird because why can't they train the AI to simply output secure code?
      The most interesting security bugs have causes that are spread across large codebases, or networks of dependencies.
      Training the AI to "output secure code" won't work if it doesn't also have access to the source code of every dependency that it's using... and even then, given current model speeds and prices most developers won't want to wait for an hour on every edit they make while the LLM reasons through all of the dependencies.
    - tptacek 2 hours ago
      What's destabilizing the industry right now isn't vulnerabilities AI introduces into new code; it's a flood of sev:hi vulnerabilities in existing code, not introduced by AI but discovered by it.
      [-]
      - chrisweekly 1 hour ago
        Agreed -- and, compounding the challenge, the flood of _reported_ high-sev CVEs is itself a kind of DDoS attack on maintainers.
    - bobkb 2 hours ago
      I think these audit tools can look beyond just security and can look for compliance audits as well. The ability to audit real targets in staging environments makes it easy to identify issues.
- binyu 2 hours ago
  Claude workflows in ultra code mode works in a very similar fashion and it consumes a moderate amount of the session usage limit, depending on the complexity of the task. With the API it would probably get expensive quickly though
- Terretta 2 hours ago
  If you compare to their managed service, that estimate is likely 1/10th expectation, depending on codebase.
  But even this larger number, in turn, can be about 1/10th the cost of a formal engagement to discover the type of findings it seems to be going for: things that do not show up from PR reviews or even /security-review without the pre-work steps in the open-source framework guided by an expert. That's not counting the time and delay to figure out how to do that engagement.
  Bluntly: if it matters, while this is a month's vibing budget for a single scan, it is also "pennies on the dollar" dirt cheap.
  At the same time, its findings still need an expert. Its suggestions may be helpful, they may be actively harmful, depends on the prework quality.
  Recommendation to IT department heads: spend a couple grand on this, use the scare page to rustle up the budget to build a relationship with a red team that can find, triage, help remediate if needed, and train your in-house team to be "security minded".
- Analemma_ 3 hours ago
  I mean, you don't need to run it all the time, right? You do it once over your entire existing codebase to start and then once over the diff in your CI/CD pipeline when you make a new change. I'm sure it's not literally that simple but I doubt these need to churn 24/7/365 either.
  [-]
  - xerxes249 3 hours ago
    In the Mythos blogpost they revealed to run the model like a 1000 times on the same code-base maybe with slightly different prompt or temperature. That suggests it will just be pay to win. If the 'attacker' spends more money/tokens than the 'defender' you will eventually be outclassed.
    [-]
    - sofixa 1 hour ago
      It's even worse, it's loot box style. Not pay to win, but pay to have the chance to win. The result will always be non-deterministic, so for some cases it can give you what you're looking for from the first time, or it can take 1000 tries.
      [-]
      - beering 1 hour ago
        It’s never not been “loot box style”. None of your past hired security audits were guaranteed to catch all issues?
  - vb-8448 3 hours ago
    You are supposed to run it on full codebase before any single PR gets merge.
  - jazz9k 3 hours ago
    Companies don't make production pushes yearly. For many, it's two week sprints..and that's one project.
    This doesn't make any sense cost-wise. It would be cheaper to just hire a security engineer.
baby 17 minutes ago
Our experience has been that without a good harness you don't really get much out of codex/claude. And you really need to spend time and energy figuring out why coding agents can't find bugs like you can.
Every week I see bugs (as an auditor) that our own harness (https://zkao.io/) can't find, and we have to figure out pretty interesting techniques in order to make the tool find them. Mind you I'm talking mostly about cryptographic vulnerabilities, not just webapp bugs. So IMO it's going to make a lot of sense for companies to have both their own harness (as tptacek is talking about) and pay for services that focus on making a good harness from experience (and audit firms are going to be the best at doing this, as they see a lot of bugs and can spend time "teaching" their harness about these bugs)
On the other hand, you have to find equally as good techniques to triage, because otherwise you just have some machinery that I call "vibe auditing" that just produces enough false positives to tire all the developers (who are already overwhelmed with crappy AI submissions in bugbounties and other AI tool that review all of their PRs).
At the end of the day, when your harness doesn't return any bug, you're left wondering "does it mean there's no bugs?" We're basically back in this reputation game, where you want to use the best tool, or the best team (that knows what the best tools are), and need to figure out which one is.
dclavijo 2 hours ago
Sligthly off topic: it seems that someone is in a dead/flag rampage killing all good links to Github in this post, why?
lanyard-textile 3 hours ago
>This repo is not maintained and is not accepting contributions.
Hm :)
[-]
- Hamuko 2 hours ago
  Why isn't Claude maintaining it?
  [-]
  - skeledrew 2 hours ago
    They pretty much saying the efficacy of the tool can be tested by anyone to determine if it's worth purchasing the more polished and up-to-date commercial offering.
- spacebacon 3 hours ago
  [flagged]
richardbarosky 3 hours ago
To be sure, security is an amazing AI/LLM use case. A huge swath of the work is pattern matching known security issues against stuff that's very precise to analyze -- programming language text.
Something that stands out is that for the strongest use cases, AI companies will prefer to sell the technique as a service rather than its raw output. For use cases where the output is less valuable, tokens are sold. If AI tokens were so magical in creating new value in developing software applications generally, they wouldn't be selling tokens directly. They'd hoard the tokens are use them to dominate SaaS software in any industry they want.
The same way as someone selling an expensive course in the stock market is signaling that they have more to gain by selling the course rather than taking their knowledge and making money in the stock market directly.
[-]
- dgellow 3 hours ago
  > The same way as someone selling an expensive course in the stock market is signaling that they have more to gain by selling the course rather than
  Or they want to diversify
  > If AI tokens were so magical in creating new value in developing software applications generally, they wouldn't be selling tokens directly.
  That requires to build and sell a whole product they have little experience with, competing with their own customers. Not a great place for an AI vendor still trying to establish itself. It’s a lot of distraction, when you already have a lot to deal with the existing business. And strategically not too valuable
- Kiro 2 hours ago
  > They'd hoard the tokens are use them to dominate SaaS software in any industry they want.
  I don't understand this argument. I've ran and sold a semi-successful SaaS. The exhausting and frustrating parts are all the things an LLM cannot help you with. Coding the product is not the bottleneck or what grants you success.
  [-]
  - zuzululu 2 hours ago
    Good point but I do think LLM helps with those frustrating parts while not being able to outright solve them.
  - richardbarosky 2 hours ago
    > Coding the product is not the bottleneck or what grants you success.
    Agree, and I think that's the core of my point.
    Not that it's irrational or doesn't make sense to sell tokens for purposes of software dev, but that if tokens were a true game changer for success in software dev, they wouldn't be leading with token sales, the same way they're not leading with token sales for security stuff -- it's more like "Contact Sales".
- hyperpape 2 hours ago
  > If AI tokens were so magical in creating new value in developing software applications generally, they wouldn't be selling tokens directly. They'd hoard the tokens are use them to dominate SaaS software in any industry they want.
  This doesn't follow at all. Anthropic's revenue is growing 10x year over year selling tokens. Their tokens can be super magical, let them enter established industries and displace incumbents, and get 100% annual growth in those industries, and they would still be better off prioritizing selling tokens, because it's a great business.
  What your argument shows is that there are limits. Their tokens are not quite powerful enough to make infinite money instantly in every area of software. Admittedly, that does seem true.
  [-]
  - morpheos137 1 hour ago
    kind of funny tokens don't prompt and steer themselves. it almost as if the value still lies with the human holding the tool.
- skybrian 3 hours ago
  Maybe, but an alternative argument that building an ecosystem is more valuable in the long run.
  We started out with many companies forbidding their employees to use remote LLMs on their source code because of security concerns. Now many companies are starting to believe that they must analyze their all their source code with remote LLMs because of security concerns. When trusting Anthropic becomes normalized, that means they can sell more services that require access to the source code.
- Melatonic 2 hours ago
  Surprised we havent gotten an integrated "MetaSploit" AI update where it calls and messages a ton of people in a company and once it starts to find someone possibly vulnerable lets a human red teamer take over or guide it more by hand.
- derf_ 1 hour ago
  > If AI tokens were so magical in creating new value in developing software applications generally, they wouldn't be selling tokens directly.
  If hardware were so magical in creating new value generally, TSMC would be designing the chips instead of selling fabrication as a service.
  That is what US chip companies used to do, by the way (back when there was silicon in Silicon Valley, before they got their lunch eaten by Taiwan). If TSMC had to design all of the chips they fabricate now, they would be doing a lot less business. Conversely, if any other company that wanted to design a chip had to build their own cutting-edge fab first, NVIDIA would not exist.
- energy123 3 hours ago
  They can only do that if they're a monopoly, which they're not
  [-]
  - DrewADesign 3 hours ago
    > They can only do that if they're a monopoly, which they're not
    Why do you say that? I reckon lots and lots of companies sell software that aren’t monopolies. Having competition, even stiff competition, isn’t anathema to running a business.
    [-]
    - energy123 3 hours ago
      You said "They wouldn't be selling tokens directly ... They'd hoard them"
      But they can't do that because they aren't monopolies.
      [-]
      - DrewADesign 1 hour ago
        > You said
        Just to clarify, I’m not the person you initially replied to.
        > "They wouldn't be selling tokens directly ... They'd hoard them" But they can't do that because they aren't monopolies.
        Hoarding them— not selling any of them, but instead using them internally and selling the products created by them — doesn’t at all seem like it would require a monopoly.
majicDave 1 hour ago
It will always be easier to find a single hole than it will be to seal every one. The hackers have all the same tools, so this is an arms race that cannot be won.
[-]
- napoleond 58 minutes ago
  It seems clear that LLMs significantly change threat model math, but this observation alone does not explain how or why; the asymmetry that you’re describing is a property of pre-LLM software as well.
- lateral_cloud 1 hour ago
  Defenders have context that attackers don't though.
newaccount12344 1 hour ago
Let's see how better it is in comparison to ZAP and Burp. I will test on https://github.com/SasanLabs/VulnerableApp which i built under SasanLabs
bobkb 2 hours ago
Very interesting.
I have working on and using a similar tool for a while now :
https://github.com/bobinson/vulture
I have been struggling with false positives and using Claude + MCP as a poor man’s audit tool. As of last few days found better result with nvidia hosted models.
euroderf 1 hour ago
Is Anthropic still majority French-owned? It would explain a lot about their entire approach to the wider ecosystem.
bigmattystyles 3 hours ago
I wonder how this sort of product is going over at Coverity and others like it. Proper SAST vendors I mean. Is it an existential threat?
[-]
- rms2ds 2 hours ago
  If I had to guess, they'l eventually just add it into their own product and hike the prices up to cover tokens lol.
crooked-v 2 hours ago
I still find it so weird that they haven't bought out whoever controls the `anthropic` github username.
trilogic 3 hours ago
https://github.com/Mainframework/Anthropic-Cybersecurity-Ski...
Be aware: the .py/s will not pass the antivirus but basically they do the job.
extr 2 hours ago
Interesting it's in python!
wslh 2 hours ago
Looking forward to trying this tomorrow (it's late here). Has anyone run it on a real codebase yet? Curious about setup friction, cost, and signal/noise.
bartoszcki 2 hours ago
> Anthropic engineers on average ship 8x as much code per quarter
Are they making 8x more features or the same amount just with more code?
[-]
- crooked-v 2 hours ago
  Going by the issues on their repos, it's 2x features and 6x regressions of bugs that were "already fixed".
zoobab 2 hours ago
Open source crap to connect to an LLM blob.
edgardurand 53 minutes ago
[flagged]
jungfty 2 hours ago
[dead]
dclavijo 2 hours ago
[dead]
zoobab 2 hours ago
'open source' crap to connect to their LLM blob.