(2025-10-13) A short rant about "democratizing" genAI, hobbyist software etc
----------------------------------------------------------------------------
As you might have noticed, my break from phlog posting had been extended by a
week from what I'd been planning. Too much had been going on throughout this 
month, so I definitely needed to keep my attention span narrow enough. I 
promise that I am going to eventually return to the previous topics like 
homebrew VMs or abacus, but right now I'd like to rant about the state of 
genAI for "mere mortals" and some other adjacent topics without a particular 
structure. So, here's my chain of thoughts (no pun intented).

Recently, I had something remind me about the existence of KoboldCpp ([1]): a
single-binary LLM inference runner that uses llama.cpp under the hood but 
also incorporates a simple web-based chat UI, a set of various API endpoints 
(its own, OpenAI-compatible and Ollama-compatible, among many others) and a 
simple GUI to easily run not only text generation, but also image 
generation, TTS and speech recognition models (via the included 
stable-diffusion.cpp, TTS.cpp and whisper.cpp distributions respectively). A 
couple years ago, before LM Studio even was a usable thing, KoboldCpp 
already gained some traction among lusers who just wanted something fully 
local for roleplaying scenarios. It was one of the first pieces of software 
that was considered something to "democratize" running LLMs for a layman. 
For me, however, as fast and efficient as it is, there are two caveats: 1) 
it doesn't directly use the llama.cpp binaries but the underlying GGML 
engine instead (exposing a totally different set of commandline parameters), 
hence some strange defaults like the default maximum generated token limit, 
2) the GGML engine version it uses doesn't get updated as often as the 
upstream llama.cpp, so some models still can be out of reach, like the 
recent Granite 4 MoE series. It's only a matter of time and patience when 
the newer engine gets finally merged into Kobold.

As of now, I haven't yet explored its image generation or speech capabilities
but I run the text generation server as follows:

#!/bin/sh
# Sensible defaults to run KoboldCpp
CTXSIZE=32768
# hardware options
# HARDOPTS="--usecpu --gpulayers 0 --usemmap"
HARDOPTS="--usevulkan --flashattention --gpulayers 99"
# run it
koboldcpp --defaultgenamt $(($CTXSIZE<8192?$CTXSIZE:8192)) --skiplauncher
--contextsize $CTXSIZE $HARDOPTS $*

Then I just pass the GGUF file as the parameter to the script and that's it.
In the model list in the API, it gets exposed as 
"koboldcpp/filename_without_ext". For CPU-only inference, the second 
(commented) line is necessary. Also, as you can see, KoboldCpp can only 
generate 8192 tokens at a time at most, even when the context window is 
larger. The upstream llama.cpp doesn't have such limitations. On the upside, 
in case you need to only deploy KoboldCpp to non-GPU systems, there is a 
"nocuda" binary version that weighs much less, and also an "oldpc" version 
that disables the AVX2 instructions automatically (which requires a separate 
CLI flag in the other builds) and performs some other tricks to run the 
models on older hardware.

Well, why would I even consider using this instead of the usual llama-server
(or llamafile, previously covered here, if I need a single-file deployment)? 
Well... KoboldCpp really is easier on RAM consumption. I've purchased a 
really cheap VPS (around $23/year!) with the only caveat being it having a 
single core and 512 MB or RAM. Are there any modern LLMs suitable for such 
amounts of RAM? Sure, there are, like Gemma 3 270M in the 4-bit QAT version. 
But аre there any engines suitable for running this LLM on such amounts of 
RAM? Well, KoboldCpp is one of them, I just had to append the "--noavx2" 
flag for it to not throw the "illegal instruction" error and still get up to 
the speed of 11 to 14 tokens per second (given that the context window size 
had to be decreased to 4096 tokens). Ten times slower than on a "normal" 
system but still perfectly adequate. The llamafile-based deployment, on the 
other hand, showed about 5 to 6 t/s at most while obviously consuming much 
more RAM per inference. If/when time permits, I'm also going to dig up some 
of my older hardware and run some more tests.

Another side of the coin is the actual usage of all this goodness. Recently,
a controversial question has emerged: "If LLMs became so good at coding, 
where's all the hobbyist level and indie software, why aren't we 
experiencing a boom of it?" Well, from my own experience, I think I know the 
answer to this question, and it's not a simple one either, it's as 
multifaceted as the technology itself. First, true indie developers most 
often are resource-conscious, so they will be the last ones to adopt LLMs 
for any coding assistance. They, and some of them rightfully so, view 
themselves as artisans instead of code monkeys, and like to be in control of 
everything that happens in their creations. I share this view myself, so I'm 
never going to use LLMs for anything but some boring but inevitable 
boilerplate parts of code. The second reason we still aren't experiencing 
that indie boom to the fullest is that many "vibe coders" really thought 
that LLM could think FOR them throughout the entire development process, 
rather than merely assisting with the boilerplate coding part. Offloading 
design decisions and mission-critical bits of code to LLMs already has lead 
to some disasters, with many and many more to come. We just aren't at the 
"good enough" phase yet, no matter how corporate marketoids are trying to 
convince you otherwise. The third reason is something I already have 
discussed a bit in the past in this phlog, and it has little to do with LLMs 
per se: desktop software just isn't that popular anymore. An average vibe 
coder would try generating a mobile app at best, but most of the time, it's 
just some React-based browser-oriented crap. I mean, instead of React, you 
can insert Next.js, Expo or whatever RAM-hogging framework is popular this 
week of the month, that doesn't really matter. What matters is that, in the 
eyes of the masses, the definition of "software" really has shifted towards 
this. And because of the amount of already existing JSX/TSX garbage to train 
on, it's pretty much the only coding thing LLMs are *kinda* good at now. Can 
it be usable by the general public? Maybe. Can it compete with the industry 
giants? Probably. Does that qualify as "indie development" in my eyes? Hell 
no. Truly independent software never requires a gigabyte-sized browser 
engine to run it.

Want to create independent desktop software? Learn Tcl/Tk, or at least
Tkinter if you already know Python (just because Python already ships with 
it). Wanna go mobile too? Learn Go + Fyne then. Wanna go online? Learn 
Elixir. There are lots of better ways of doing stuff than just succumbing to 
the mainstream crapware frameworks for languages that aren't supposed to be 
used for that stuff in the first place. By choosing the right tool for the 
job in the beginning, you make it much easier for your future self. And 
then, and only then, like I already said some time ago, you may use LLMs to 
assist you. Not think for you but to help you with writing the boring parts 
of your implementation while you're still in control. Sober, aware and 
independent.

--- Luxferre ---

[1]: https://github.com/LostRuins/koboldcpp