“GPU acceleration capabilities in llamafiles are limited, making them primarily optimized for CPU inference. If your workflow demands GPU-intensive operations or extremely high inference throughput, you might find llamafiles less efficient compared to GPU-optimized cloud solutions.”
Definitely going to be a dealbreaker for a lot of people.
Last release was May 14, and I only see a handful of commits (looks like minor refactoring). Is this actively maintained / worked on?
Particularly asking because I've been using it and it is great for what it does, but if it doesn't work on new models I'm going to feel more pressure to look at alternatives (which is a pain, because I am quite sure none of the can compete with "download this one file executable and point it at a .gguf file" for ease of use).
I built a file sharing CLI called ffl which is also an APE built on Cosmopolitan Libc, just like llamafile.
Since llamafiles can be quite large (multi-GB), I built ffl to help 'ship' these single-file binaries easily across different OSs using WebRTC. It feels natural to pair an APE transfer tool with APE AI models.
Definitely going to be a dealbreaker for a lot of people.
Particularly asking because I've been using it and it is great for what it does, but if it doesn't work on new models I'm going to feel more pressure to look at alternatives (which is a pain, because I am quite sure none of the can compete with "download this one file executable and point it at a .gguf file" for ease of use).
Since llamafiles can be quite large (multi-GB), I built ffl to help 'ship' these single-file binaries easily across different OSs using WebRTC. It feels natural to pair an APE transfer tool with APE AI models.
https://github.com/nuwainfo/ffl
https://github.com/mozilla-ai/llamafile/discussions/809
Wonderful project, please check it out.