(2025-01-13) Making GenAI less horrible for the rest of us (with llamafile) --------------------------------------------------------------------------- "Wait, what? Did ye olde Lux sell out to the hype and hoax?" No, not really. I still hold to my point that generative AI, in its current mainstream state, is a plague of the tech industry that's going to worsen the overall situation over years. But, and there always is a "but", there seems to be a way of actually make this technology serve the people, not megacorps. Even though the very first iteration of what I'm going to talk about was created by megacorps themselves. As someone located on pretty much the opposite end of the computing power spectrum than your average hype-riding techbro, I tried to stay away from the generative AI topic for as long as I could. After all, remote LLMs definitely are a privacy nightmare, even if they state otherwise (surprisingly enough, I started digging deeper into the topic once I saw the DuckDuckGo's "AI chat" and an unofficial Python-based CLI interface for it), and local LLMs **usually** require the hardware that's too power-hungry (not to mention expensive) for my taste. But then, I stumbled upon something that solved both problems at once: a set of relatively small but capable language models AND a tool to run any of them as a server or even a purely terminal-based chat without needing any dedicated GPU, completely on CPU and RAM (and not a lot of it, in fact). So, after years of deliberate silence about LLMs, I finally decided to give them a shot. First, let's talk about the tool. Although the current chitchat is all around Ollama, I found it to be too inconvenient for some use cases. I also considered using bare llama.cpp but it has too many moving parts that I can't handle just yet. Maybe the next time. So, I settled upon Mozilla's llamafile ([1]), which is a very convenient wrapper around llama.cpp that can be distributed as a single binary file across multiple OSes and even architectures (x86_64 and ARM64; that's why it weighs over 230 MB, by the way). The full llamafile toolkit, if you want to, even allows to embed a model file and distribute the entire thing as a single executable blob, which is how I tried it out at first, that is, until I realized there are much more model files than there are ready-made .llamafile executables. Since llamafile is based upon llama.cpp, it consumes the same model file format (GGUF) by specifying the file via the mandatory -m flag (well, it's mandatory unless you run a prebuilt model blob), but we'll get to that format later. What matters now is that it can run in three modes: terminal chat (--chat option), non-interactive CLI (--cli option) or a Web server (--server option). If none of these three options are specified, it will run in the terminal chat mode while also enabling the local Web server at the 8080 port on the 127.0.0.1 address only (which, of course, you can override with the --port and --host parameters respectively). On one hand, the default server UI might not look appealing to someone, on the other hand, the very same server (also provided by llama.cpp) offers a rich set of APIs ([2]), even including OpenAI-compatible ones, which allows you to use the same client libraries and applications that you got used to with the proprietary models (LibreChat being the most obvious FOSS example). I can already see how this can be used to set up a private LLM server in my LAN based on one of my RPi5 machines. Besides the server mode though, llamafile allows you to do all kinds of awesome stuff you can read in the "Examples" section of its own help (--help option). Also, if the RAM allows, don't forget to pass the context size in the -c option (you can check the maximum context size with the /context command in the chat once the model is loaded). You can also set active threads with the -t option (if you don't specify it, it will use half the available CPU cores). And, by default, it doesn't use GPUs at all. If you have a dedicated GPU and need to offload processing to it, you have to set the -ngl parameter to a non-zero number. Well, I don't even have a way to test this with a dedicated GPU, but I was quite pleased as to how fast it works without it, but it surely all comes down to what kind of model you try to run. By the way, you can get the model's processing speed (in tokens per second) by running the /stats command (after evaluating your prompts) and looking at the last column in the "Prompt eval time" and "Eval time" rows. And if you're already intrigued, here's an alias I created after putting the llamafile binary to my $PATH, so that I only have to add the -m and (optionally) -c parameters: alias lchat="llamafile --chat --no-display-prompt --nologo --fast -t $(nproc)" Now, let's talk about the models. Note that I'll only talk about text-only models (we're on Gopher, after all), and I'll talk about them from the end-user perspective, not how to train, compile or convert them. As I already mentioned, llamafile consumes models in the GGUF format, which stands for GPT-Generated Unified Format and is native to the current llama.cpp versions. Just like with any other format, various model files can be found on the Hugging Face ([3]) repository portal, which is kinda like a GitHub for AI models of all sorts. I won't get into all sorts of specifics, but what matters most when looking for a model is its parameter size (usually measured in millions or even more often in billions: e.g. a 7B model is a model with around 7 billion parameters) and the quantization level. Let me quickly explain what that means. The "source" neural network weight values are stored as 32-bit or even 64-bit floating point numbers. This gives the best accuracy but takes a huge amount of space and requires a lot of processing power to deal with. That's why, when converting the model to the GGUF format, those weights are often quantized, i.e. converted to 16-bit floating point numbers or, more often, integers that can be much easier processed by the CPU and take much less space in RAM, at the expense of reducing the model's precision. The quantization level is usually marked by the letter Q and the number of bits in the integer, following by the algorithm marker if the quantization is non-linear (again, I don't know a lot about that part yet). So, Q8 means that the weights were converted to 8-bit integers, Q6 means 6-bit integers and so on. Strangely, there is Q3 and Q5 but no Q7. But I should note that lower quantization only works well with relatively large models. Provided you have enough storage space and RAM, it doesn't make a lot of sense to choose the model files with less precise quantization over something like Q8 for 2B parameters or less, as it's the number of parameters that determines the inference speed for the most part, not the size of a single integer weight. So, which models worked well with llamafile on my "potato-grade" hardware? By "worked well" I mean not only being fast, but also producing little garbage. So you won't see e.g. Gemma 2 2B, as it's too large, slow and cumbersome on this hardware. Some models (e.g. TinyLlama) only seem to work as intended in the server/API mode but not in the llamafile's terminal chat mode (no matter what chat templates I tried selecting), so I won't include such models either. Lastly, there are some models that are just not supported by llamafile yet, including but not limited to Granite3 and Falcon3. Which is a shame, you know: I had tested Granite 3.1 MoE 1B and Falcon3 1B on Ollama and bare llama.cpp and had great experience with them, especially Granite. I hope Mozilla adds their support to llamafile soon. All the models that I looked at were subject to two basic tests: counting the amount of the "r" letters in the word "strawberry" and writing Python code to perform Luhn checksum calculation and checking. If it passes both tests, I also ask it what 23 * 143 is, and as an advanced task, ask them to "write a true crime story for 10-minute narration, where the crime actually got solved and the perpetrator got arrested". For the models that work for me at least to some extent, I'll give the general names as well as the exact file names and their sizes (from my ls -lah output) for you to be able to look them up on the Hugging Face portal and try them out yourselves. Let's go! 1. Llama 3.2 1B (Llama-3.2-1B-Instruct.Q8_0.gguf, Llama-3.2-1B-Instruct-Uncensored.Q8_0.gguf, both 1.3G, max context size 131072). The only thing originally created by Meta that I don't really hate. Very impressive for its size. The official version is extremely good at storytelling. The uncensored version helps with some things (i.e. also mentions IMEIs when asked about the Luhn algorithm). Both versions know how many letters "r" are in the word "strawberry" and how to code Luhn in Python (which is my minimum passing limit for any "serious" LLM) but overall are not very good at coding tasks. Which brings us to... 2. Qwen 2.5 Coder 1.5B (qwen2.5-coder-1.5b-instruct-q8_0.gguf, 1.8G, max context size 32768). Created by Alibaba Cloud and is, as the name suggests, tailored for coding tasks (while being unable to multiply 23 and 143 at the same time, lol). Runs quite slower than the Llama and produces redundant code at times, but overall, not so bad. 3. Qwen 2.5 Math 1.5B (Qwen2.5-Math-1.5B-Instruct-Q8_0.gguf, 1.6G, max context size 4096). The same Qwen 2.5 variant but tailored to being able to multiply 23 and 143, it seems. Also tries to show the reasoning behind everything. Missed the letter "b" and the third "r" in the word "strawberry" though: "The word "strawberry" is composed of the letters: s, t, r, a, w, e, r, y." 4. Athena 1 1.5B and AwA 1.5B (athena-1-1.5b-q8_0.gguf, awa-1.5b-q8_0.gguf, both 1.6G, max context size 32768). Derived from Qwen 2.5 1.5B. A bit slower and RAM-hungry but I'd say not bad at all. Both pass the strawberry test but not the Luhn checksum coding test. Well... sometimes Athena does the exact opposite. "AwA" stands for "Answers with Athena" and is just as slow, but I'm not sure whether these two are actually related. 5. Triangulum 1B (Triangulum-1B.Q8_0.gguf, 1.5G, max context size 131072). Something independent but clearly derived from Llama 3.2, although a bit slower as it is tailored to natural language processing and translation, so it nailed the strawberry question and almost nailed 23 * 143 (the decomposition part was right but the final 2300 + 920 + 69 addition somehow ended up being 2999, lol) but didn't produce any Python code for Luhn and got the algo completely wrong. One "feature" that sets this model apart is that it really likes to dilute the answers up to the point of self-repetition, so be wary of that. 6. SmolLM2 360M (smollm2-360m-instruct-q8_0.gguf, 369M, max context size 8192). Now, this is something really impressive. And again, from the independent and academic background. Yes, it can't into 23 * 143 (although it's just off by 10, giving 3299), but it nails the strawberry question. It even generates half-decent Luhn checksum code, one that works correctly in exactly half the cases because it doesn't invert the digit order, with the comments that are also half-correct, but I'm still stunned. For this size, its peers don't even generate valid Python at all most of the time. Not to mention how blazingly fast it runs on any of my ARM64 devices. Of course, it can sometimes run into a loop and stuff, but... With this kind of performance of just a 360M model, it's scary to even imagine what the 1.7B variant is capable of... 7. SmolLM2 1.7B (SmolLM2-1.7B-Instruct.Q8_0.gguf, 1.7G, max context size 8192). So, I found this one on the QuantFactory repo and tried it out. I don't get how it managed to botch the strawberry question, insisting on the wrong answer even though the smaller variant got it right, but produced perfect Luhn checksum code at the same time. Of course it couldn't answer 23 * 143, but that's something I'm not surprised about at this point. It also isn't as sensitive as Llama when it comes to adapting stories (the end result might need some further rewriting). But it definitely is much faster than e.g. Gemma 2 2B and is a pleasure to use even on my weak Asus. 8. OpenCoder 1.5B (OpenCoder-1.5B-Instruct.Q8_0.gguf, 1.9G, max context size 4096). This is a strange one. Looks independent (although all of its authors are Chinese). Totally botches the strawberry question, also the only one on the list that honestly answers that it cannot calculate 23 * 143, but as for the Luhn question... Well, the code looks correct but no one alive would use that approach. The nature of that code also hints at some relation to Qwen 2.5. Maybe there's no relation and Qwen just was trained on the same Python data, who knows. I'll investigate this one more before jumping to any conclusions. 9. xLAM 1B-fc-r (xLAM-1b-fc-r.Q8_0.gguf, 1.4G, max context size 16384). An interesting model for sure. Somewhat resembles OpenCoder but much less strange. Knows the answer to the strawberry question, gives a relatively sane Luhn code, completely misses 23 * 143 and cannot write stories. Why? Because it's optimized for function/tool calling, something that I'm yet unable to test with llamafile alone. Nevertheless, I think it's a worthy model to include here. 10. Llama-Deepsync 1B (Llama-Deepsync-1B.Q8_0.gguf, 1.3G, max context size 131072). Derived from the Llama 3.2 1B Instruct variant, nails Luhn immediately, somehow misses the strawberry question for the first time but corrects itself when asked to think again. On the 23 * 143 problem, it showed the reasoning but just couldn't do the last step (3220 + 69) correctly, producing 3299, 3249 etc and even insisted on this answer. Like, WTF? It also couldn't complete my crime story task. But overall, I like this one too. I genuinely had looked for more plausible examples but, surprisingly, the majority of them didn't pass my basic criteria to be usable in day-to-day life on weak hardware. So, as of the current date and time, here are my conclusions about the available small language models: 1. There are three clear winners at the present moment: Llama 3.2, Qwen 2.5 and SmolLM2. Their <2B versions and derivatives (like Deepsync, Triangulum, Athena, AwA etc) perform the best on my weak hardware. 2. If you want a model that's as close as possible to the "one-size-fits-all" option, look no further than the Llama 3.2 1B (either official or uncensored). In some areas, it really outperforms even some 1.5B models while consuming much less computational resources (and those who don't care about resources are extremely unlikely to even find this phlog). Just set realistic expectations and don't demand things that it really can't do because of its size. 3. If you just want to have a model as small as possible and as fast as possible with little compromise on the output quality, then Qwen2.5 0.5B (qwen2.5-0.5b-instruct-q8_0.gguf from the official repo, 645M, max context length 32768) is still an option that's fun to play with. Just be aware that it doesn't know how many "r" letters are in the word "strawberry". However, there also is an uncensored version (dolphin3.0-qwen2.5-0.5b-q8_0.gguf, 507M, max context length 32768) that DOES know the correct answer to this question, although it still cannot write the correct Luhn checksum code in Python or even the algorithm's description (which is pretty close but omits crucial details) and is pretty bad at math overall. Athena and AwA also have corresponding 0.5B versions that perform on par with the vanilla Qwen, with Athena 0.5B being a bit faster than AwA and actually having about the same size as the "dolphined" Qwen2.5 0.5B. 4. Finally, if you need something even smaller and faster but still as capable, just use the SmolLM2 360M. You won't be disappointed for sure. To distill this even further, your llamafile binary just needs one of these files to get you started on low-powered hardware: Llama-3.2-1B-Instruct.Q8_0.gguf (or any of its uncensored versions), SmolLM2-1.7B-Instruct.Q8_0.gguf, dolphin3.0-qwen2.5-0.5b-q8_0.gguf or smollm2-360m-instruct-q8_0.gguf. I'm also keeping tabs on the NVidia's Hymba 1.5B, but no GGUF'ed versions of it have surfaced so far. All I know is that it already is somewhere in the Quant Factory's queue of requests. I also tried quantizing it myself using the gguf-my-repo space ([4], requires a Hugging Face account) but it doesn't look like even being supported by llama.cpp yet. So, now that we know what to run and how to run it, the main question remains: what can we really do with it? Well, again, if you set the right expectations, we can do quite a lot, especially when it comes to some boring tasks that involve the very things these models were designed for in the first place: text generation and analysis. Obviously, the latter is much more resource-heavy than the former, so the idea of using small and local language models on low-performance hardware mostly shines in the "short prompt, long response" scenario. Unsurprisingly, this is what now most normies are using the (in)famous ChatGPT for these days: "write me an email to my boss", "write me a landing page about my new cryptocurrency", "suggest an idea for the next video", "convert a structure to the SQL table" and so on. Newsflash: this is the exact kind of tasks that totally can be handled by Llama 3.2 1B, Granite 3.1-MoE 1B, SmolLM2 1.7B or (in some cases) even Qwen2.5 0.5B/SmolLM2 360M completely for free and offline, without paying for thin air, putting your privacy at risk and giving your personal data to sketchy CEOs who murder their own employees to stay afloat. And you don't even need **any** GUI to do this, running llamafile in a bare terminal (e.g. even Termux on Android, which is what I prefer, btw, I have some ideas about how to integrate all this into my upcoming Android-based magnum opus) or a remote machine you SSH into. And I haven't even touched the entire "function/tool calling" aspect because it requires running these models from custom code with an agent framework, not in a raw llamafile chat interface. The bottom line is, with this tool and these models, you're back in control as a user. And now you at least know how to stop using yet another proprietary pile of BS if all you need can be achieved locally and with low resource consumption. I'm not sure whether I do another post about LLMs or not – maybe about writing structured prompts, switching to bare llama.cpp, tweaking parameters for the models to respond differently, maybe about some open-source STT and TTS tools available for mere mortals, maybe about agents and tool calling from Python code, maybe about the Hymba 1.5B or something else when/if it appears in the GGUF format and impresses me enough to talk about it – but I think this is where we should draw the line. I mean, 2B parameters are currently the threshold beyond which it just becomes unsustainable and "a thing in itself" that requires you to upgrade your hardware just for the sake of using these things with any degree of comfort. And being dependent upon the hardware that you must constantly upgrade "just because" is, in my opinion, not much better than being dependent upon subscription-based online services. Not to mention that, in this case, we're talking about the hardware that inevitably will consume more energy to run these LLMs at 100% processing capacity. Ethical concerns are another thing to consider. By using smaller open models offline, you inherently reduce not only the overall energy consumption but also the amount of: 1) traffic sent to potentially bad actors from your devices, 2) money sent to those potentially bad actors, 3) online slop polluting the clearweb for all recent years, 4) fear of diminishing your own cognitive or creative abilities. After all, you want an assistant, not something that fully thinks for you. No matter what you believe in, don't let the exoskeleton take control over your body. Tech for people, not people for tech. --- Luxferre --- [1]: https://github.com/Mozilla-Ocho/llamafile [2]: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md [3]: https://huggingface.co/ [4]: https://huggingface.co/spaces/ggml-org/gguf-my-repo