(2023-04-14) Some thoughts about simple stream compression
----------------------------------------------------------
This blog already contains my opinion about archive formats and why they must
not be confused with compression formats. Well, now it's time to talk about 
the latter. But, unlike archive formats, I won't point to ready-made or 
established solutions or algorithms this time, and just will express some 
thoughts on how I would approach the problem of stream compression myself 
using the simplest techniques available.

I'm emphasizing on the stream part because I want to stress the mandatory
condition that the compressor or decompressor can only operate on the data 
it already encountered when reading the stream, and the fact the set of data 
it can operate on is limited. I.e. it cannot read all the data at once into 
memory to create symbol statistics tables, nor it can allocate unbound 
memory chunks during the process to adapt the already compressed or 
decompressed data. Stream compression always operates on what we have right 
here and now, plus maybe some small buffer whose size is always fixed 
regardless of the input length.

I'm also emphasizing on the simplest techniques. That is, nothing more
complex to implement than LZW or even LZP. And keep in mind that even these 
two can be hard to do in really memory-constrained conditions. Think on the 
scale of around 4096 bytes max of available RAM for both your program itself 
AND the buffers it may allocate. We are, however, allowed to run the data 
stream through any amount of these tiny encoders/decoders to get the final 
result. So, it's fine to use various simple techniques and combine them at 
the higher level.

Let's start with the most obvious one - RLE, run-length encoding. The idea
here is as old as this world: if the byte is the same as the previous one, 
don't write it to the output stream and just increment the counter, and as 
the byte changes, write the counter and the previous byte to the output 
stream. When it comes to real-life implementations, however, several 
questions arise, the first couple of them being: how do we store the length 
itself and how do we reliably separate the length from the payload so that 
the decoder won't confuse the two?

Most RLE implementations I've seen just limit the maximum length of runs to
128, 256 or some other value that would conveniently fit into the fixed 
amount of bits. This is straightforward but quite inefficient. Suppose we're 
processing a 800x600 video frame with a static color, this means every 
channel in this frame (R, G, B or Y, Cr, Cb - doesn't matter) has 800*200 = 
160000 identical bytes. If we adopt the simplest RLE compression scheme 
[len][byte] where [len] is also a byte from 0 to 255 (encoding the length 
from 1 to 256), this means we can split our 160000 bytes into 625 256-byte 
chunks, and this means we have to write 1250 bytes of output data (625 
identical pairs of 0xff and whatever the actual value is) because we have 
limited the maximum run length to 256 by design. Now, if only there was a 
way to encode the length of the length itself, so that we could still 
separate it from the data byte, wouldn't waste the extras but also wouldn't 
have the upper limit...

And it turns out there is a way to do this, moreover, the most popular and
versatile text encoding in the world - UTF-8 - also uses it. This technique 
is called unary coding. In unary coding, the amount of leading 1 bits before 
the first 0 encodes the value, so zero bit itself acts as a delimiter here. 
In UTF-8, this amount of leading ones in the header byte indicates how many 
bytes the entire character will take. For our purposes, this amount would 
indicate how many extra bytes the length field will take. On top of that, we 
could borrow the principle from UTF-8 and only use a single byte for 128 
byte long runs maximum instead of 256. I.e. the length of 128 would be 
encoded (in binary) as 01111111 (0 as the delimiter and 1111111 as the 
value), while the length of 129 would already be encoded in two bytes as 
10000000 10000000 (1 as unary-coded extra length, 0 as the delimiter, and 
the value minus 1 aligned to the end of the sequence). The maximum run 
length one could encode in two bytes would then be 10111111 11111111, or 
16384. The length of 16385 would be then encoded as 11000000 01000000 
00000000, and three bytes would fit in the length up to 2097152 (11011111 
11111111 11111111), and so on. Although, in practice, I doubt one would 
encounter over 2 megabytes of a single repeated byte, but nevertheless it is 
infinitely scalable if you need to. We might make the encoding even more 
efficient by considering the offsets from the previous maximums (e.g. 129 
would be encoded as 10000000 00000000 instead of 10000000 10000000), thus 
increasing the range of values a little, but the practical effect would be 
negligible compared to the amount of operations we'd need to add every time 
to do this.

So, for the case of our static frame, how well could we compress a single
channel with the unary-coded RLE (or should I call it RLUE - run-length 
unary encoding?) instead of fixed length RLE? Let's encode the length of 
160000 with our unary coding. First, we need to subtract 1 from the value 
itself (159999) and convert it into binary, and we get 100111000011111111. 
Nice. Then, we split it into bytes starting from the LSB: 10 01110000 
11111111. So, we get three bytes, which means we need to use two ones and 
the zero delimiter at the start. So, our RLUE value of 160000 becomes 
11000010 01110000 11111111, or C2 70 FF in hex. Now, add the color byte 
itself, and we get 4 bytes per channel. We have compressed an entire 
480000-byte static frame into 12 bytes with this improved RLUE technique 
that, if you think about it, doesn't really introduce any additional 
complexity into the encoding or decoding process compared to the fixed RLE 
variant.

Of course, this is an ideal situation. In reality, the runs normally will be
much shorter (although they may be longer than the usual 128 or 256 bytes, 
so the RLUE approach is still good to use). But what's going to happen even 
more frequently is that every following byte of your data will **slightly** 
differ from the previous one and that's why a perfect run will not be 
shaped. So, the next natural question would be: how can we make use of the 
fact that the bytes are close to each other in order to compress them 
better? Well, the answer is delta encoding. Instead of writing the current 
byte value to the output stream, we write the difference between the current 
and the previous bytes (at the beginning of the stream, we assume the 
previous byte is always 0). Now, in case of equal or strictly sequential 
data (like 1, 3, 5, 7, 9, 11...) delta encoding turns most of it into the 
perfect run (1, 2, 2, 2, 2, 2...), but in case of slightly varying data 
we're interested in (like 77, 78, 75, 81, 80, 77, 76, 79... ) delta encoding 
alone (77, 1, -3, 6, -1, -3, -1, 3...) won't be sufficient enough to achieve 
any compression. So, we must combine deltas with another simple code that 
represents integer numbers closer to zero with less bits than the numbers 
further from zero.

Since it's just single bytes we're dealing with and, while encoding deltas,
we always subtract the previous value from the current one, wrapping it 
around if necessary, the numbers we must be able to encode with whatever we 
choose range from -128 to 127. Since most compression codes were designed 
for non-negative numbers, a more practical approach would be to remap our 
delta d to the 0..255 range using the following formula: 2d if d>=0, -2d-1 
if d<0. This way, deltas with small absolute values will still be closer to 
zero. Some codes also don't like encoding zeros, so we add 1 and get the 
following code points: 2d+1 if d>=0, -2d if d<0. So now, once we have 
converted every delta to an integer from 0 to 255 or from 1 to 256, what do 
we choose?

After a long train of thought, I'm leaning towards the Golomb-Rice code. In
the variant I know it, it's a clever application of our old friend unary 
encoding. Here, it doesn't apply to the entire number though, but only to 
the upper part of it. I.e. we split the bit representation of a number into 
the quotient and remainder, where remainder is the lower m bits (usually 
these codes are named after M = 2**m, e.g. GR8 for 3-bit remainder, GR16 for 
4-bit remainder and so on), and we encode the quotient in unary and then 
write the remainder as is. Decoding is also straightforward here: we read 
the unary quotient representation until the first zero bit, then the known 
amount of bits in the remainder, and reconstruct the number by decoding the 
quotient from unary and just concatenating the remainder. Compared to the 
pure unary and some other encoding schemes (like Exp-Golomb or Elias code 
family), Golomb-Rice code allows us to adjust the tradeoff between fixing 
the bit length of the smaller values and expanding the bit length of the 
larger ones. This is why this kind of encoding is especially good for 
encoding deltas: the smaller, the better.

Again, remember that we aim for simplicity here, so we can afford losing some
compression ratio but we must produce byte-aligned output. So, if we, for 
example, read four bytes of input from the stream, we can only encode them 
to 1, 2, 3 or 4 bytes of output, nothing in between. This is why, to make 
use of GR codes, we still have to read input stream data into some buffer. 
As you can see from the following example, it may be small but still needs 
to be present. Note that for the decoder to know how many Golomb codewords 
to read next, the buffer length must be passed beforehands as well, and we 
can do this in the same encoding. Note that byte alignment doesn't hamper 
the code readability because we know how many bits to read at the beginning 
of every codeword: everything up to the first 0 and then exactly 3, 4 or 5 
bits if we're using GR8, GR16 or GR32 respectively.

Example:

Input stream buffer data: ababagalamaga
Data length in bytes: 13
Data length encoded in GR8: 10101
Data length encoded in GR16: 01101
Data length encoded in GR32: 001101
Total source bit length: 104 bits
Data ASCII: 97 98 97 98 97 103 97 108 97 109 97 103 97
Data deltas: 97 1 -1 1 -1 6 -6 11 -11 12 -12 6 -6
Absolute deltas (0-based): 194 2 1 2 1 12 11 22 21 24 23 12 11
GR8 bitstream: 10101 1111111111111111111111110010 0010 0001 0010 0001 10100
10011 110110 110101 1110000 110111 10100 10011
GR8 bit length: 94 (90.4% of original)
GR8 bit length, byte-aligned: 96 (92.3% of original)
GR16 bitstream: 01101 11111111111100010 00010 00001 00010 00001 01100 01011
100110 100101 101000 100111 01100 01011
GR16 bit length: 86 (82.7% of original)
GR16 bit length, byte-aligned: 88 (84.6% of original)
GR32 bitstream: 001101 111111000010 000010 000001 000010 000001 001100 001011
010110 010101 011000 010111 001100 001011
GR32 bit length: 90 (86.5% of original)
GR32 bit length, byte-aligned: 96 (92.3% of original)

What conclusion can we draw from this? Actually, none, except that delta+GR16
encoding performs better at compressing ababagalamagas, that we know for 
sure. The longer buffer we can capture and the closer its data bytes are to 
each other, the better the results will be.

Now, a really good question is: which order should we apply all these
techniques in? The most obvious answer is that it fully depends on the 
nature of your data. But what if we don't know? Like, in our case with video 
frames, we can estimate their length but that's it, we don't know if they 
are going to be mostly static colors or not. If we are compressing text data 
(like source codes in a shar archive), there is absolutely no sense in even 
trying to apply any RLE compression, while GR-encoded deltas still could be 
of some use. If we're working with something like UTF-encoded text or some 
data that already has undergone fixed-sized RLE, it might make sense to not 
process single bytes, but pairs or even triplets of them instead. But that's 
something I'll tell you about next time, when all these thoughts are sooner 
or later going to take some practical shape.

--- Luxferre ---