7 comments

  • foundry27 58 minutes ago
    I like this idea. This might be one of the more effective social pressures available for getting inference providers to fix long-standing issues. AWS Bedrock, for example, has crippling defects in its serving stack for Kimi’s K2 and K2.5 models that cause 20%-30% of attempts to emit tool calls to instead silently end the conversation (with no token output). That makes AWS effectively irrelevant as a serious inference provider for Kimi, and conveniently pushes users onto Bedrock’s significantly more expensive Anthropic models for comparable performance on agentic tasks.
  • bobbiechen 3 hours ago
    If I understand correctly, threat model here seems to be to protect against accidental issues that would impact performance, but doesn't cover malicious actor.

    For example, Sketchy Provider tells you they are running the latest and greatest, but actually is knowingly running some cheaper (and worse) model and pocketing the difference. These tests wouldn't help since Sketchy Provider could detect when they're being tested and do the right thing (like the Volkswagen emissions scandal). Right?

    • nulltrace 45 minutes ago
      Catching accidental drift is still worth a lot. It's basically the same idea as performance regression tests in CI, nobody writes those because they expect sabotage. It's for the boring stuff, like "oops, we bumped a dep and throughput dropped 15%".

      If someone actually goes out of their way to bypass the check, that's a pretty different situation legally compared to just quietly shipping a cheaper quant anyway.

    • gpm 2 hours ago
      Yes and no.

      For a truly malicious actor, you're right. But it shifts it from "well we aren't obviously committing fraud by quantizing this model and not telling people" to "we're deliberately committing fraud by verifying our deployment with one model and then serving customer requests with another".

      I suspect there's a lot of semi-malicious actors who are only happy to do the former.

    • j-bos 2 hours ago
      Seems like a great challenge for all these systems, see fromtier labs serving quants when under hesvy load.
  • comboy 13 minutes ago
    Message I send 2 days ago to a friend:

        FYI be careful with openrouter for your LLM generated content, was playing with kimi spent hours tuning knobs (openrouter also have different providers tested them all and used Exacto), long story short:
        Direct API with thinking disabled: 0 errors across all 16 runs. OpenRouter: 4/10 had errors. 
    
    This is on a complex task, Kimi is really good, and I wouldn't know if I haven't used their API directly.
  • gertlabs 20 minutes ago
    This is real issue in our benchmarks. Beware of OpenRouter providers that don't specify quantizations or use lower ones than you might be expecting. OpenRouter does provide configuration options for this, and it often limits your options significantly. That being said, even with the best providers, Kimi-K2-thinking was underwhelming and slow on our benchmarks, albeit interesting and useful for temperature/variation.

    Kimi K2.6, however, is the new open source leader, so far. Agentic evaluations still in progress, but one-shot coding reasoning benchmarks are ready at https://gertlabs.com/?mode=oneshot_coding

  • OsamaJaber 3 hours ago
    Good to see this exist. Inference providers quietly swap quant levels. Most users never check. A standard verifier from the model maker is the right move, would love to see other labs ship the same
  • seism 3 hours ago
    A test that runs for 15 hours on a high powered rig is going to be hard to reproduce or scale. But I think this addresses a widespread concern, which affects all kinds of cloud services. What you ping is not necessarily what you get.
    • Lalabadie 33 minutes ago
      You can run the whole suite once at the start for each vendor, then roll through each part of it over a two or four week cycle, mimicking regular use. That jeeps the evaluation up to date over time.
  • curioussquirrel 2 hours ago
    After Anthropic, Moonshot is another model provider who restricts tweaking of sampling parameters. I do like the idea of the vendor verifier, though.
    • charcircuit 45 minutes ago
      If the post training is done with specific sampling parameters it would make sense to only use the parameters it was trained with.