Better Models: Worse Tools

(lucumr.pocoo.org)

70 points | by leemoore 3 hours ago

16 comments

  • simonreiff 11 minutes ago
    Hey, an article right up my alley! AI infrastructure/tools engineer here (hic-ai.com); my flagship product, HIC Mouse, is a precision-editing system for coding agents designed to work across a wide array of models and harnesses. Mouse provides 11 tools exposed via MCP for read-, find-, and edit-operations, using a coordinate-based schema (as well as exact and multiple string replacement), a Dialog Box inspect/refine/save/cancel changes functionality controlled by the agent to force staging and review of multi-operation or large edits before changes are written to disk, and extensive agent guidance mechanisms or guardrails to help the agent realize if it's about to do something potentially destructive or overly verbose.

    I definitely think models may be trained to use particular popular harnesses or expect certain fields in the editing-tool or other tool schemas. Rather than trying to conform to (or force) one particular format, my approach instead is to design flexibly enough to handle a wide array of possible inputs and tool calls, but that also help the agent recover whenever its tool calls truly can't be salvaged and have to return etrors, and to auto-normalize results whenever reasonable to do so. It really does make a very dramatic difference (I wouldn't have bothered to launch if I thought it wasn't a meaningful advance) but anyway, just wanted to share my perspective given that I live and breathe this problem all day, every day.

  • socketcluster 1 hour ago
    When building agent integration for my serverless backend https://saasufy.com/, I decided to not use MCP but to put curl commands inside skill markdown files instead: https://github.com/Saasufy/skills

    The curl command is extremely popular so models seem to be really good at using it.

    Also I like that curl uses a bash syntax and my platform requires JSON payloads; it makes the separation clear to the agent. I find it to be very reliable.

    • gchamonlive 31 minutes ago
      The skills are very readable too, so you win a nice documentation for free. At the very least it's human readable machine instructions.
  • aetherspawn 27 minutes ago
    Surprised models still output tools as text when for ages we’ve been able to constrain the output at the inference engine level and constrain the model what tools, parameters etc are available

    Edit: found it, it’s called Grammar-Constrained Decoding (GCD)

  • xyzsparetimexyz 11 minutes ago
    Does Pi even need read/write/edit tools? Couldn't it just have bash commands and get the model to use e.g. sed for everything?
  • lukasco 2 hours ago
    It sounds like harnesses might have to start to have model by model system prompts, though retrying works, I guess. It reminds me of the ancient times when browsers all read HTML and CSS differently, and differently on different devices. In that sense, this is nothing new. I was going to say, at least we don't have different device types, but then, the model still has to output the right variant of `grep` as well.
    • the_mitsuhiko 2 hours ago
      The problem with hyper targeting harnesses to models is that you end up locking yourself quite quickly into special behaviors of models, and you make your sessions non transferrable. That can be an acceptable trade-off and I know people who do that.
    • dofm 2 hours ago
      The flip side of this is training models to better understand harness interaction, I suppose, which (if I understand it properly and I am in no way sure I do) appears to be what the Qwen AgentWorld model is doing?
  • dofm 2 hours ago
    As critical as I am about articles endlessly concerned with the weaknesses of closed-source cloud LLMs, this one is pretty great, and not just because it concerns interactions with Pi, which looks to me like it's going to end up a sort of quasi-reference implementation of an open source harness, and because it has so much useful technical detail.

    But:

    "Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology."

    Only implicitly?

    --

    Many decades ago when I was working on research related to using MOOs as a learning environment, you would add "tool calls" into the stream of text that a MOO object might generate, so your rich client would e.g. show a picture, load a web page in a frame, move you on a map, trigger a change in an on-screen representation of an object.

    Everyone who tried this in MUD/MUSH/MOO clients ran into more or less the same problems that LLM clients do: any attempt to shoehorn control sequences into in-band content was riddled with security risks, objects accidentally triggering the wrong interface etc.; you could never truly communicate out-of-band.

    The more I read about how agentic harnesses work, the less embarrassed I feel about the code twenty-something-year-old me wrote in a MOO client.

  • sestep 1 hour ago
    > In case you are curious about Fable: I intentionally did not test it because I was not sure if the classifiers they are running might downgrade me to Opus silently.

    Is this still a thing? I thought Anthropic walked back the silent downgrades so now all the different domains downgrade non-silently.

    • resonious 1 hour ago
      Claude Code downgrades loudly but I'm not sure what happens over API or with other harnesses, OpenRouter, etc.
  • mappu 2 hours ago
    In my harness i implemented apply_patch just taking unified diffs for patch -p1. I was shocked to see how bad models are at generating them. I started logging diff failures to analyse -

    - All models are terrible at generating line numbers for a proper diff, give up on them

    - Some models (Owl-alpha) must have been post-trained on Codex transcripts, because they occasionally push its V4A patch format into any diff tool available

    - Codex puts a lot of info in its system prompt about the desired patch style, making larger hunks instead of granular ones, etc

    • fractorial 1 hour ago
      In my harness, I implemented tool_edit as a subset of Rob Pike’s Sam editor syntax [0].

      Only need ~650 tokens of system prompt for it to work. It’s pretty stellar.

      [0] https://9p.io/sys/doc/sam/sam.html

  • wseqyrku 2 hours ago
    > You can ask the model to produce valid JSON

    Doesn't always work, for better performance you can kneel and start begging

  • _doctor_love 1 hour ago
    This makes sense to me, much as I don't like it. IMHO the strategy taken by StrongDM's attractor coding agent seems like a path of least resistance. Directly target the LLM providers APIs and directly target their default tools.
  • wxw 48 minutes ago
    > [...] newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nested edits[] array

    > My strongest hypothesis is that this is not random deterioration but a training artifact. [...] Anthropic’s own client appears to expect and accept a fair amount of slop and repairs it, mostly silently

    > If reinforcement learning happens in a harness like that, or a simulation of one, then slightly malformed tool calls can still complete the task and receive reward.

    > Worse, the model may become very strongly adapted to the canonical Claude Code edit tool shape.

    > Tool schemas are somewhere in the distribution and some shapes are close to what the model saw during post-training and some are far away.

    Great article.

    Interesting root cause hypothesis. Couldn't one simply strip the slop-handling from the RL env's harness to avoid this though?

    I do agree on the walled garden being built here. Proprietary frontier models performing best in proprietary harnesses makes sense for Anthropic's interests.

  • ares623 2 hours ago
    Open source developer surprised and concerned by the trajectory their favorite proprietary software is taking.
  • cyanydeez 2 hours ago
    building deterministic tools on non-determinism is hard enough; try adding another layer where your cloud provider decides to massage the context, realigns it's permitted output, arbitrarily downgrades context to cheaper models, or they hire an MBA who determines your plan value can be tied to a degraded model under a new shrinkfied.

    It's amazing anyone watched the last 2 decades of tech's enshitification and wants to hook their wagon to this shitshow.

  • onchainbuilder 8 minutes ago
    [flagged]
  • sleepynoodle 42 minutes ago
    [dead]