I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Last night I had it write me a complete plugin for my LLM tool like this:
llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm -m mlx-community/gemma-3-27b-it-qat-4bit \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers
issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same
markdown logic as the HTML page to turn that into a
fragment'
More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
>I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.
Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.
Unlikely they think Microsoft or GitHub wants to steal it.
With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.
But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.
That's why AWS Bedrock and Google Vertex AI and Azure AI model inference exist - they're all hosted LLM services that offer the same compliance guarantees that you get from regular AWS-style hosting agreements.
> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
Yes, but I don't really know anything about those. https://www.reddit.com/r/LocalLLaMA/ is full of people running models on PCs with NVIDIA cards.
The unique benefit of an Apple Silicon Mac at the moment is that the 64GB of RAM is available to both the GPU and the CPU at once. With other hardware you usually need dedicated separate VRAM for the GPU.
Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
Been playing with that, but doesn’t seem to have much effect. It works very well to limit output to smaller bits, like setting it to 100-200. But above 2-4k the output seems to never get longer than about 1 page
Might try using the models with mlx instead of ollama to see if that makes a difference
Any tips on prompting to get longer outputs?
Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?
> Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size per request, subtracting the request input tokens
I don't know how to get it to output anything that length though.
edit: just tried mistral3.1 and the quality of the output is very good, at least compared to the other models I tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and deepseek-r1:14b)
Doing some research, because of their training sets, it seems like most models are not trained on producing long outputs so even if they technically could, they won’t. Might require developing my own training dataset and then doing some fine tuning. Apparently the models and ollama have some safeguards against rambling and repetition
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.
In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using).
It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.
so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.
How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).
The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.
The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.
It is important to know about both to decide between the two for your use case though.
Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama.
Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...
I tried sillytavern a few weeks ago... wow, that is an "interesting" UI! I blundered around for a while, couldn't figure out how to do anything useful... and then installed LM Studio instead.
I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?
FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the 27b qat. You can't really compare inference engine to inference engine without keeping the hardware and model fixed.
Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)
I'm doing that with a 12GB card, ollama supports it out of the box.
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
I didn't realize that the context would require such so much memory. Is this KV caches? It would seem like a big advantage if this memory requirement could be reduced.
Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.
There are many such apps, e.g. Mollama, Enclave AI or PrivateLLM or dozens of others, but you could tell me it runs at 1,000,000 tokens/second on an iPhone and I wouldn't care because the largest model version you're going to be able to load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you still need context) and it's just not worth the time to use.
That said, if you really care, it generates faster than reading speed (on an A18 based model at least).
The use case is more "I'm willing to have really bad answers that have extremely high rates of making things up" than based on the application. The same goes for summarization, it's not like it does it well like a large model would.
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)
QAT “quantization aware training” means they had it quantized to 4 bits during training rather than after training in full or half precision. It’s supposedly a higher quality, but unfortunately they don’t show any comparisons between QAT and post-training quantization.
The partnership with Ollama and MLX and LM Studio and llama.cpp was revealed in that announcement, which made the models a lot easier for people to use.
I'm literally asking, quite honestly, if this is just an 'after the fact' update literally weeks later, that they uploaded a bunch of models, or if there is something more significant about this I'm missing.
Last time we only released the quantized GGUFs. Only llama.cpp users could use it (+ Ollama, but without vision).
Now, we released the unquantized checkpoints, so anyone can quantize themselves and use in their favorite tools, including Ollama with vision, MLX, LM Studio, etc. MLX folks also found that the model worked decently with 3 bits compared to naive 3-bit, so by releasing the unquantized checkpoints we allow further experimentation and research.
TL;DR. One was a release in a specific format/tool, we followed-up with a full release of artifacts that enable the community to do much more.
Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?
I have found Gemma models are able to produce useful information about more niche subjects that other models like Mistral Small cannot, at the expense of never really saying "I don't know", where other models will, and will instead produce false information.
For example, if I ask mistral small who I am by name, it will say there is no known notable figure by that name before the knowledge cutoff. Gemma 3 will say I am a well known <random profession> and make up facts. On the other hand, I have asked both about local organization in my area that I am involved with, and Gemma 3 could produce useful and factual information, where Mistral Small said it did not know.
They are multimodal. Havent tried the QAT one yet. But the gemma3s released a few weeks ago are pretty good at processing images and telling you details about what’s in them
Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.
This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
Local models, due to their size more than big cloud models, favor popular languages rather than more niche ones. They work fantastic for JavaScript, Python, Bash but much worse at less popular things like Clojure, Nim or Haskell. Powershell is probably on the less popular side compared to Js or Bash.
If this is your main use case you can always try to fine tune a model.
I maintain a small llm bench of different programming languages and the performance difference between say Python and Rust on some smaller models is up to 70%
I think Powershell is a bad test. I've noticed all local models have trouble providing accurate responses to Powershell-related prompts. Strangely, even Microsoft's model, Phi 4, is bad at answering these questions without careful prompting. Though, MS can't even provide accurate PS docs.
My best guess is that there's not enough discussion/development related to Powershell in training data.
With a 128K context length and 8 bit KV cache, the 27b model occupies 22 GiB on my system. With a smaller context length you should be able to fit it on a 16 GiB GPU.
When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.
It makes intuitive sense to me that this would be possible, because LLMs are still mostly opaque black boxes. I expect you could drop a whole hunch of the weights without having a huge impact on quality - maybe you end up mostly ditching the parts that are derived from shitposts on Reddit but keep the bits from Arxiv for example.
(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)
its just bs benchmarks. they are all cheating at this point feeding the data in the training set. doesnt mean the llm arent becoming better but when they all lie...
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.
I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.
What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.
With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.
But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.
* https://en.wikipedia.org/wiki/Blinkenlights
Using AWS is basically a "physical server they have control of".
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.
("AI assistant" is an evolutionary dead end, and Star Trek be damned.)
Are there non-mac options with similar capabilities?
The unique benefit of an Apple Silicon Mac at the moment is that the 64GB of RAM is available to both the GPU and the CPU at once. With other hardware you usually need dedicated separate VRAM for the GPU.
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
Might try using the models with mlx instead of ollama to see if that makes a difference
Any tips on prompting to get longer outputs?
Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?
> Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size per request, subtracting the request input tokens
I don't know how to get it to output anything that length though.
Will keep experimenting, will also try mistral3.1
edit: just tried mistral3.1 and the quality of the output is very good, at least compared to the other models I tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and deepseek-r1:14b)
Doing some research, because of their training sets, it seems like most models are not trained on producing long outputs so even if they technically could, they won’t. Might require developing my own training dataset and then doing some fine tuning. Apparently the models and ollama have some safeguards against rambling and repetition
By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.
./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png
Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.
https://ollama.com/library/gemma3:27b-it-qat
https://ollama.com/library/gemma3:12b-it-qat
https://ollama.com/library/gemma3:4b-it-qat
https://ollama.com/library/gemma3:1b-it-qat
The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.
It is important to know about both to decide between the two for your use case though.
Whatever those keyword things are, they certainly don't seem to be doing any form of RAG.
Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.
https://github.com/vllm-project/vllm/issues/16856
I can imagine this will serve to drive prices for hosted llms lower.
At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).
We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.
I would normally say VLLM, but the blog post notably does not mention VLLM support.
Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).
(I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)
For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.
Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.
To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.
Prompt Tokens: 10
Time: 229.089 ms
Speed: 43.7 t/s
Generation Tokens: 41
Time: 959.412 ms
Speed: 42.7 t/s
That said, if you really care, it generates faster than reading speed (on an A18 based model at least).
For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second
Am I missing something?
These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.
Is there something more to this, or is just a follow up blog post?
(is it just that ollama finally has partial (no images right?) support? Or something else?)
How is this more significant now than when they were uploaded 2 weeks ago?
Are we expecting new models? I don’t understand the timing. This post feels like it’s two weeks late.
[1] - https://huggingface.co/collections/google/gemma-3-qat-67ee61...
The partnership with Ollama and MLX and LM Studio and llama.cpp was revealed in that announcement, which made the models a lot easier for people to use.
> 17 days ago
Anywaaay...
I'm literally asking, quite honestly, if this is just an 'after the fact' update literally weeks later, that they uploaded a bunch of models, or if there is something more significant about this I'm missing.
Last time we only released the quantized GGUFs. Only llama.cpp users could use it (+ Ollama, but without vision).
Now, we released the unquantized checkpoints, so anyone can quantize themselves and use in their favorite tools, including Ollama with vision, MLX, LM Studio, etc. MLX folks also found that the model worked decently with 3 bits compared to naive 3-bit, so by releasing the unquantized checkpoints we allow further experimentation and research.
TL;DR. One was a release in a specific format/tool, we followed-up with a full release of artifacts that enable the community to do much more.
For example, if I ask mistral small who I am by name, it will say there is no known notable figure by that name before the knowledge cutoff. Gemma 3 will say I am a well known <random profession> and make up facts. On the other hand, I have asked both about local organization in my area that I am involved with, and Gemma 3 could produce useful and factual information, where Mistral Small said it did not know.
I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.
However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.
If this is your main use case you can always try to fine tune a model. I maintain a small llm bench of different programming languages and the performance difference between say Python and Rust on some smaller models is up to 70%
My best guess is that there's not enough discussion/development related to Powershell in training data.
Shouldn’t it fit a 5060 Ti 16GB, for instance?
(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)