(2025-01-13) Making GenAI less horrible for the rest of us (with llamafile)
---------------------------------------------------------------------------
"Wait, what? Did ye olde Lux sell out to the hype and hoax?"
No, not really. I still hold to my point that generative AI, in its current
mainstream state, is a plague of the tech industry that's going to worsen 
the overall situation over years. But, and there always is a "but", there 
seems to be a way of actually make this technology serve the people, not 
megacorps. Even though the very first iteration of what I'm going to talk 
about was created by megacorps themselves.

As someone located on pretty much the opposite end of the computing power
spectrum than your average hype-riding techbro, I tried to stay away from 
the generative AI topic for as long as I could. After all, remote LLMs 
definitely are a privacy nightmare, even if they state otherwise 
(surprisingly enough, I started digging deeper into the topic once I saw the 
DuckDuckGo's "AI chat" and an unofficial Python-based CLI interface for it), 
and local LLMs **usually** require the hardware that's too power-hungry (not 
to mention expensive) for my taste. But then, I stumbled upon something that 
solved both problems at once: a set of relatively small but capable language 
models AND a tool to run any of them as a server or even a purely 
terminal-based chat without needing any dedicated GPU, completely on CPU and 
RAM (and not a lot of it, in fact). So, after years of deliberate silence 
about LLMs, I finally decided to give them a shot.

First, let's talk about the tool. Although the current chitchat is all around
Ollama, I found it to be too inconvenient for some use cases. I also 
considered using bare llama.cpp but it has too many moving parts that I 
can't handle just yet. Maybe the next time. So, I settled upon Mozilla's 
llamafile ([1]), which is a very convenient wrapper around llama.cpp that 
can be distributed as a single binary file across multiple OSes and even 
architectures (x86_64 and ARM64; that's why it weighs over 230 MB, by the 
way). The full llamafile toolkit, if you want to, even allows to embed a 
model file and distribute the entire thing as a single executable blob, 
which is how I tried it out at first, that is, until I realized there are 
much more model files than there are ready-made .llamafile executables.

Since llamafile is based upon llama.cpp, it consumes the same model file
format (GGUF) by specifying the file via the mandatory -m flag (well, it's 
mandatory unless you run a prebuilt model blob), but we'll get to that 
format later. What matters now is that it can run in three modes: terminal 
chat (--chat option), non-interactive CLI (--cli option) or a Web server 
(--server option). If none of these three options are specified, it will run 
in the terminal chat mode while also enabling the local Web server at the 
8080 port on the 127.0.0.1 address only (which, of course, you can override 
with the --port and --host parameters respectively). On one hand, the 
default server UI might not look appealing to someone, on the other hand, 
the very same server (also provided by llama.cpp) offers a rich set of APIs 
([2]), even including OpenAI-compatible ones, which allows you to use the 
same client libraries and applications that you got used to with the 
proprietary models (LibreChat being the most obvious FOSS example). I can 
already see how this can be used to set up a private LLM server in my LAN 
based on one of my RPi5 machines. Besides the server mode though, llamafile 
allows you to do all kinds of awesome stuff you can read in the "Examples" 
section of its own help (--help option). Also, if the RAM allows, don't 
forget to pass the context size in the -c option (you can check the maximum 
context size with the /context command in the chat once the model is 
loaded). You can also set active threads with the -t option (if you don't 
specify it, it will use half the available CPU cores). And, by default, it 
doesn't use GPUs at all. If you have a dedicated GPU and need to offload 
processing to it, you have to set the -ngl parameter to a non-zero number. 
Well, I don't even have a way to test this with a dedicated GPU, but I was 
quite pleased as to how fast it works without it, but it surely all comes 
down to what kind of model you try to run. By the way, you can get the 
model's processing speed (in tokens per second) by running the /stats 
command (after evaluating your prompts) and looking at the last column in 
the "Prompt eval time" and "Eval time" rows.

And if you're already intrigued, here's an alias I created after putting the
llamafile binary to my $PATH, so that I only have to add the -m and 
(optionally) -c parameters:
alias lchat="llamafile --chat --no-display-prompt --nologo --fast -t $(nproc)"

Now, let's talk about the models. Note that I'll only talk about text-only
models (we're on Gopher, after all), and I'll talk about them from the 
end-user perspective, not how to train, compile or convert them. As I 
already mentioned, llamafile consumes models in the GGUF format, which 
stands for GPT-Generated Unified Format and is native to the current 
llama.cpp versions. Just like with any other format, various model files can 
be found on the Hugging Face ([3]) repository portal, which is kinda like a 
GitHub for AI models of all sorts. I won't get into all sorts of specifics, 
but what matters most when looking for a model is its parameter size 
(usually measured in millions or even more often in billions: e.g. a 7B 
model is a model with around 7 billion parameters) and the quantization 
level. Let me quickly explain what that means. The "source" neural network 
weight values are stored as 32-bit or even 64-bit floating point numbers. 
This gives the best accuracy but takes a huge amount of space and requires a 
lot of processing power to deal with. That's why, when converting the model 
to the GGUF format, those weights are often quantized, i.e. converted to 
16-bit floating point numbers or, more often, integers that can be much 
easier processed by the CPU and take much less space in RAM, at the expense 
of reducing the model's precision. The quantization level is usually marked 
by the letter Q and the number of bits in the integer, following by the 
algorithm marker if the quantization is non-linear (again, I don't know a 
lot about that part yet). So, Q8 means that the weights were converted to 
8-bit integers, Q6 means 6-bit integers and so on. Strangely, there is Q3 
and Q5 but no Q7. But I should note that lower quantization only works well 
with relatively large models. Provided you have enough storage space and 
RAM, it doesn't make a lot of sense to choose the model files with less 
precise quantization over something like Q8 for 2B parameters or less, as 
it's the number of parameters that determines the inference speed for the 
most part, not the size of a single integer weight. 

So, which models worked well with llamafile on my "potato-grade" hardware? By
"worked well" I mean not only being fast, but also producing little garbage. 
So you won't see e.g. Gemma 2 2B, as it's too large, slow and cumbersome on 
this hardware. Some models (e.g. TinyLlama) only seem to work as intended in 
the server/API mode but not in the llamafile's terminal chat mode (no matter 
what chat templates I tried selecting), so I won't include such models 
either. Lastly, there are some models that are just not supported by 
llamafile yet, including but not limited to Granite3 and Falcon3. Which is a 
shame, you know: I had tested Granite 3.1 MoE 1B and Falcon3 1B on Ollama 
and bare llama.cpp and had great experience with them, especially Granite. I 
hope Mozilla adds their support to llamafile soon.

All the models that I looked at were subject to two basic tests: counting the
amount of the "r" letters in the word "strawberry" and writing Python code 
to perform Luhn checksum calculation and checking. If it passes both tests, 
I also ask it what 23 * 143 is, and as an advanced task, ask them to "write 
a true crime story for 10-minute narration, where the crime actually got 
solved and the perpetrator got arrested". For the models that work for me at 
least to some extent, I'll give the general names as well as the exact file 
names and their sizes (from my ls -lah output) for you to be able to look 
them up on the Hugging Face portal and try them out yourselves. Let's go!

1. Llama 3.2 1B (Llama-3.2-1B-Instruct.Q8_0.gguf,
Llama-3.2-1B-Instruct-Uncensored.Q8_0.gguf, both 1.3G, max context size 
131072). The only thing originally created by Meta that I don't really hate. 
Very impressive for its size. The official version is extremely good at 
storytelling. The uncensored version helps with some things (i.e. also 
mentions IMEIs when asked about the Luhn algorithm). Both versions know how 
many letters "r" are in the word "strawberry" and how to code Luhn in Python 
(which is my minimum passing limit for any "serious" LLM) but overall are 
not very good at coding tasks. Which brings us to...
2. Qwen 2.5 Coder 1.5B (qwen2.5-coder-1.5b-instruct-q8_0.gguf, 1.8G, max
context size 32768). Created by Alibaba Cloud and is, as the name suggests, 
tailored for coding tasks (while being unable to multiply 23 and 143 at the 
same time, lol). Runs quite slower than the Llama and produces redundant 
code at times, but overall, not so bad.
3. Qwen 2.5 Math 1.5B (Qwen2.5-Math-1.5B-Instruct-Q8_0.gguf, 1.6G, max
context size 4096). The same Qwen 2.5 variant but tailored to being able to 
multiply 23 and 143, it seems. Also tries to show the reasoning behind 
everything. Missed the letter "b" and the third "r" in the word "strawberry" 
though: "The word "strawberry" is composed of the letters: s, t, r, a, w, e, 
r, y."
4. Athena 1 1.5B and AwA 1.5B (athena-1-1.5b-q8_0.gguf, awa-1.5b-q8_0.gguf,
both 1.6G, max context size 32768). Derived from Qwen 2.5 1.5B. A bit slower 
and RAM-hungry but I'd say not bad at all. Both pass the strawberry test but 
not the Luhn checksum coding test. Well... sometimes Athena does the exact 
opposite. "AwA" stands for "Answers with Athena" and is just as slow, but 
I'm not sure whether these two are actually related.
5. Triangulum 1B (Triangulum-1B.Q8_0.gguf, 1.5G, max context size 131072).
Something independent but clearly derived from Llama 3.2, although a bit 
slower as it is tailored to natural language processing and translation, so 
it nailed the strawberry question and almost nailed 23 * 143 (the 
decomposition part was right but the final 2300 + 920 + 69 addition somehow 
ended up being 2999, lol) but didn't produce any Python code for Luhn and 
got the algo completely wrong. One "feature" that sets this model apart is 
that it really likes to dilute the answers up to the point of 
self-repetition, so be wary of that.
6. SmolLM2 360M (smollm2-360m-instruct-q8_0.gguf, 369M, max context size
8192). Now, this is something really impressive. And again, from the 
independent and academic background. Yes, it can't into 23 * 143 (although 
it's just off by 10, giving 3299), but it nails the strawberry question. It 
even generates half-decent Luhn checksum code, one that works correctly in 
exactly half the cases because it doesn't invert the digit order, with the 
comments that are also half-correct, but I'm still stunned. For this size, 
its peers don't even generate valid Python at all most of the time. Not to 
mention how blazingly fast it runs on any of my ARM64 devices. Of course, it 
can sometimes run into a loop and stuff, but... With this kind of 
performance of just a 360M model, it's scary to even imagine what the 1.7B 
variant is capable of...
7. SmolLM2 1.7B (SmolLM2-1.7B-Instruct.Q8_0.gguf, 1.7G, max context size
8192). So, I found this one on the QuantFactory repo and tried it out. I 
don't get how it managed to botch the strawberry question, insisting on the 
wrong answer even though the smaller variant got it right, but produced 
perfect Luhn checksum code at the same time. Of course it couldn't answer 23 
* 143, but that's something I'm not surprised about at this point. It also 
isn't as sensitive as Llama when it comes to adapting stories (the end 
result might need some further rewriting). But it definitely is much faster 
than e.g. Gemma 2 2B and is a pleasure to use even on my weak Asus.
8. OpenCoder 1.5B (OpenCoder-1.5B-Instruct.Q8_0.gguf, 1.9G, max context size
4096). This is a strange one. Looks independent (although all of its authors 
are Chinese). Totally botches the strawberry question, also the only one on 
the list that honestly answers that it cannot calculate 23 * 143, but as for 
the Luhn question... Well, the code looks correct but no one alive would use 
that approach. The nature of that code also hints at some relation to Qwen 
2.5. Maybe there's no relation and Qwen just was trained on the same Python 
data, who knows. I'll investigate this one more before jumping to any 
conclusions.
9. xLAM 1B-fc-r (xLAM-1b-fc-r.Q8_0.gguf, 1.4G, max context size 16384). An
interesting model for sure. Somewhat resembles OpenCoder but much less 
strange. Knows the answer to the strawberry question, gives a relatively 
sane Luhn code, completely misses 23 * 143 and cannot write stories. Why? 
Because it's optimized for function/tool calling, something that I'm yet 
unable to test with llamafile alone. Nevertheless, I think it's a worthy 
model to include here.
10. Llama-Deepsync 1B (Llama-Deepsync-1B.Q8_0.gguf, 1.3G, max context size
131072). Derived from the Llama 3.2 1B Instruct variant, nails Luhn 
immediately, somehow misses the strawberry question for the first time but 
corrects itself when asked to think again. On the 23 * 143 problem, it 
showed the reasoning but just couldn't do the last step (3220 + 69) 
correctly, producing 3299, 3249 etc and even insisted on this answer. Like, 
WTF? It also couldn't complete my crime story task. But overall, I like this 
one too.

I genuinely had looked for more plausible examples but, surprisingly, the
majority of them didn't pass my basic criteria to be usable in day-to-day 
life on weak hardware. So, as of the current date and time, here are my 
conclusions about the available small language models:

1. There are three clear winners at the present moment: Llama 3.2, Qwen 2.5
and SmolLM2. Their <2B versions and derivatives (like Deepsync, Triangulum, 
Athena, AwA etc) perform the best on my weak hardware.
2. If you want a model that's as close as possible to the "one-size-fits-all"
option, look no further than the Llama 3.2 1B (either official or 
uncensored). In some areas, it really outperforms even some 1.5B models 
while consuming much less computational resources (and those who don't care 
about resources are extremely unlikely to even find this phlog). Just set 
realistic expectations and don't demand things that it really can't do 
because of its size. 
3. If you just want to have a model as small as possible and as fast as
possible with little compromise on the output quality, then Qwen2.5 0.5B 
(qwen2.5-0.5b-instruct-q8_0.gguf from the official repo, 645M, max context 
length 32768) is still an option that's fun to play with. Just be aware that 
it doesn't know how many "r" letters are in the word "strawberry". However, 
there also is an uncensored version (dolphin3.0-qwen2.5-0.5b-q8_0.gguf, 
507M, max context length 32768) that DOES know the correct answer to this 
question, although it still cannot write the correct Luhn checksum code in 
Python or even the algorithm's description (which is pretty close but omits 
crucial details) and is pretty bad at math overall. Athena and AwA also have 
corresponding 0.5B versions that perform on par with the vanilla Qwen, with 
Athena 0.5B being a bit faster than AwA and actually having about the same 
size as the "dolphined" Qwen2.5 0.5B.
4. Finally, if you need something even smaller and faster but still as
capable, just use the SmolLM2 360M. You won't be disappointed for sure.

To distill this even further, your llamafile binary just needs one of these
files to get you started on low-powered hardware: 
Llama-3.2-1B-Instruct.Q8_0.gguf (or any of its uncensored versions), 
SmolLM2-1.7B-Instruct.Q8_0.gguf, dolphin3.0-qwen2.5-0.5b-q8_0.gguf or 
smollm2-360m-instruct-q8_0.gguf. I'm also keeping tabs on the NVidia's Hymba 
1.5B, but no GGUF'ed versions of it have surfaced so far. All I know is that 
it already is somewhere in the Quant Factory's queue of requests. I also 
tried quantizing it myself using the gguf-my-repo space ([4], requires a 
Hugging Face account) but it doesn't look like even being supported by 
llama.cpp yet.

So, now that we know what to run and how to run it, the main question
remains: what can we really do with it?

Well, again, if you set the right expectations, we can do quite a lot,
especially when it comes to some boring tasks that involve the very things 
these models were designed for in the first place: text generation and 
analysis. Obviously, the latter is much more resource-heavy than the former, 
so the idea of using small and local language models on low-performance 
hardware mostly shines in the "short prompt, long response" scenario. 
Unsurprisingly, this is what now most normies are using the (in)famous 
ChatGPT for these days: "write me an email to my boss", "write me a landing 
page about my new cryptocurrency", "suggest an idea for the next video", 
"convert a structure to the SQL table" and so on. Newsflash: this is the 
exact kind of tasks that totally can be handled by Llama 3.2 1B, Granite 
3.1-MoE 1B, SmolLM2 1.7B or (in some cases) even Qwen2.5 0.5B/SmolLM2 360M 
completely for free and offline, without paying for thin air, putting your 
privacy at risk and giving your personal data to sketchy CEOs who murder 
their own employees to stay afloat. And you don't even need **any** GUI to 
do this, running llamafile in a bare terminal (e.g. even Termux on Android, 
which is what I prefer, btw, I have some ideas about how to integrate all 
this into my upcoming Android-based magnum opus) or a remote machine you SSH 
into. And I haven't even touched the entire "function/tool calling" aspect 
because it requires running these models from custom code with an agent 
framework, not in a raw llamafile chat interface.

The bottom line is, with this tool and these models, you're back in control
as a user. And now you at least know how to stop using yet another 
proprietary pile of BS if all you need can be achieved locally and with low 
resource consumption. I'm not sure whether I do another post about LLMs or 
not – maybe about writing structured prompts, switching to bare llama.cpp, 
tweaking parameters for the models to respond differently, maybe about some 
open-source STT and TTS tools available for mere mortals, maybe about agents 
and tool calling from Python code, maybe about the Hymba 1.5B or something 
else when/if it appears in the GGUF format and impresses me enough to talk 
about it – but I think this is where we should draw the line. I mean, 2B 
parameters are currently the threshold beyond which it just becomes 
unsustainable and "a thing in itself" that requires you to upgrade your 
hardware just for the sake of using these things with any degree of comfort. 
And being dependent upon the hardware that you must constantly upgrade "just 
because" is, in my opinion, not much better than being dependent upon 
subscription-based online services. Not to mention that, in this case, we're 
talking about the hardware that inevitably will consume more energy to run 
these LLMs at 100% processing capacity. 

Ethical concerns are another thing to consider. By using smaller open models
offline, you inherently reduce not only the overall energy consumption but 
also the amount of: 1) traffic sent to potentially bad actors from your 
devices, 2) money sent to those potentially bad actors, 3) online slop 
polluting the clearweb for all recent years, 4) fear of diminishing your own 
cognitive or creative abilities. After all, you want an assistant, not 
something that fully thinks for you. No matter what you believe in, don't 
let the exoskeleton take control over your body. Tech for people, not people 
for tech.

--- Luxferre ---

[1]: https://github.com/Mozilla-Ocho/llamafile
[2]:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
[3]: https://huggingface.co/
[4]: https://huggingface.co/spaces/ggml-org/gguf-my-repo