9 comments

  • CharlesW 4 hours ago
    Previously: https://news.ycombinator.com/item?id=48709744

    https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."

    • NitpickLawyer 2 hours ago
      > It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive.

      How is that a serious phrase in '26? I mean I have no idea if this fine-tune is good, haven't tried it, but testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!

      • nodja 1 hour ago
        Last thing you want a model to do is hallucinate a tool call and it's outputs...
      • reactordev 1 hour ago
        Visual Inspection Before Execution… it’s all vibe…
      • vikingcat 2 hours ago
        Maybe expecting it to recognize it's limitation without tools instead of hallucinate. But yeah, not wholly useful. It's performance (and proclivity to hallucinations) with tools is what really matters.
  • ricardobayes 2 hours ago
    This is the first Qwen fine-tune that is not immediately rejected by the local LLM community, and in some cases even being recommended. Based on my limited usage, it is good, gives creative solutions to coding problems. I don't expect 9-35B models to one-click create full apps. Most people who were complaining did so .
    • v3ss0n 1 hour ago
      Its not any better. Most of us at LocalLLama community dont like it except a few new people poping out and making posts.
      • gslepak 56 minutes ago
        Indeed, it performed worse than Qwen3.6-27b in my basic test.

        It gave a fancier looking answer, but did a worse job following the prompt.

        • dofm 52 minutes ago
          Roughly my experience so far; it trips up on itself a bit.

          However, it's much more inclined to do web search unprompted, which is fascinating in its own way.

      • NitpickLawyer 45 minutes ago
        > LocalLLama community

        Ah, the place that shit on gpt-oss because it wasn't good at porn. That place is not what it used to be, hasn't been since that karpathy tweet, tbh. It's mostly slop and vibes nowadays.

        • v3ss0n 26 minutes ago
          and a lot of bots advertising a rename models like this one.
    • monkmartinez 1 hour ago
      > Most people who were complaining did so .

      It has been this way since the beginning, unfortunately. There is certainly no harm in trying on local models on local workloads with modest guardrails.

      Like most of these models (Qwen, Gemma, Llama, gpt-oss), finding all the little gotchas like, special tokens and prompt structure, model preference are a PITA right now. The reward are really nice models that run exceptionally well in agentic harnesses tuned with the prompts and parameters you fought so hard to learn.

    • arcanemachiner 1 hour ago
      We must be in different communities... Qwen models are the most recommended ones that will actually run on local hardware that is accessible to the masses!
      • montroser 1 hour ago
        Yeah, but they're talking about fine-tunes.
  • kennywinker 4 hours ago
    Can anyone explain what’s the story here? Is this just a re-skinned qwen? Who is deepreinforce-ai and why isn’t this model listed on their website?

    How does it self-improve, does the model change on disk - or just during a single context run it gets better?

    • simonw 4 hours ago
      It doesn't self-improve, that's a misleading headline.

      As far as I can tell they trained it by running their own reinforcement learning on top of Qwen and Gemma 4 (not sure how they combined weights from both, or if they used Qwen as the basis and Gemma 4 to help train?) - so the "self-improving" is about their training process, not how you use the weights.

      • kamranjon 3 hours ago
        I think the 9b and 31b dense are Gemma models and the 35B-MoE, and 397B-MoE are Qwen models since these are model sizes covered by each of them respectively
      • sisve 1 hour ago
        Do you think we will get a self-improving model in 26 or 27? Maybe not a native one but some kind of hack so a model will learn something without loosing part of the context window?
      • kennywinker 3 hours ago
        Gotcha. That makes more sense. We ran the model to train the model -> “self-improving”.
    • v3ss0n 1 hour ago
      Clickbait title.
  • S0y 2 hours ago
    These are simply benchmaxxed versions of either Qwen or Gemma 4.
    • 2001zhaozhao 1 hour ago
      If so, it's impressive they managed to benchmaxx Qwen even further than it's already benchmaxxed.
      • v3ss0n 1 hour ago
        Nah , they just put graphs with different color prioritizing themselves.
    • jorisw 2 hours ago
      Citation needed
      • S0y 46 minutes ago
        Sure. https://deep-reinforce.com/ornith_1_0.html

        >Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

        >Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts.

  • anana_ 2 hours ago
    They keep mentioning a 31B dense model, but there are no benchmarks or weights for it anywhere?
  • v3ss0n 1 hour ago
    Self-Improving bullshit. It is just Qwen 3.5 finetune benchmaxxed . Nothing spectacular . even fails at benchmarks. Long session tool calls sucks and hallucinate a lot with that too. Just use Qwen 3.6 and 3.5 122b.
  • fratefritto 1 hour ago
    [flagged]