(2023-04-14) Some thoughts about simple stream compression
----------------------------------------------------------
This blog already contains my opinion about archive formats and why they must
not be confused with compression formats. Well, now it's time to talk about
the latter. But, unlike archive formats, I won't point to ready-made or
established solutions or algorithms this time, and just will express some
thoughts on how I would approach the problem of stream compression myself
using the simplest techniques available.
I'm emphasizing on the stream part because I want to stress the mandatory
condition that the compressor or decompressor can only operate on the data
it already encountered when reading the stream, and the fact the set of data
it can operate on is limited. I.e. it cannot read all the data at once into
memory to create symbol statistics tables, nor it can allocate unbound
memory chunks during the process to adapt the already compressed or
decompressed data. Stream compression always operates on what we have right
here and now, plus maybe some small buffer whose size is always fixed
regardless of the input length.
I'm also emphasizing on the simplest techniques. That is, nothing more
complex to implement than LZW or even LZP. And keep in mind that even these
two can be hard to do in really memory-constrained conditions. Think on the
scale of around 4096 bytes max of available RAM for both your program itself
AND the buffers it may allocate. We are, however, allowed to run the data
stream through any amount of these tiny encoders/decoders to get the final
result. So, it's fine to use various simple techniques and combine them at
the higher level.
Let's start with the most obvious one - RLE, run-length encoding. The idea
here is as old as this world: if the byte is the same as the previous one,
don't write it to the output stream and just increment the counter, and as
the byte changes, write the counter and the previous byte to the output
stream. When it comes to real-life implementations, however, several
questions arise, the first couple of them being: how do we store the length
itself and how do we reliably separate the length from the payload so that
the decoder won't confuse the two?
Most RLE implementations I've seen just limit the maximum length of runs to
128, 256 or some other value that would conveniently fit into the fixed
amount of bits. This is straightforward but quite inefficient. Suppose we're
processing a 800x600 video frame with a static color, this means every
channel in this frame (R, G, B or Y, Cr, Cb - doesn't matter) has 800*200 =
160000 identical bytes. If we adopt the simplest RLE compression scheme
[len][byte] where [len] is also a byte from 0 to 255 (encoding the length
from 1 to 256), this means we can split our 160000 bytes into 625 256-byte
chunks, and this means we have to write 1250 bytes of output data (625
identical pairs of 0xff and whatever the actual value is) because we have
limited the maximum run length to 256 by design. Now, if only there was a
way to encode the length of the length itself, so that we could still
separate it from the data byte, wouldn't waste the extras but also wouldn't
have the upper limit...
And it turns out there is a way to do this, moreover, the most popular and
versatile text encoding in the world - UTF-8 - also uses it. This technique
is called unary coding. In unary coding, the amount of leading 1 bits before
the first 0 encodes the value, so zero bit itself acts as a delimiter here.
In UTF-8, this amount of leading ones in the header byte indicates how many
bytes the entire character will take. For our purposes, this amount would
indicate how many extra bytes the length field will take. On top of that, we
could borrow the principle from UTF-8 and only use a single byte for 128
byte long runs maximum instead of 256. I.e. the length of 128 would be
encoded (in binary) as 01111111 (0 as the delimiter and 1111111 as the
value), while the length of 129 would already be encoded in two bytes as
10000000 10000000 (1 as unary-coded extra length, 0 as the delimiter, and
the value minus 1 aligned to the end of the sequence). The maximum run
length one could encode in two bytes would then be 10111111 11111111, or
16384. The length of 16385 would be then encoded as 11000000 01000000
00000000, and three bytes would fit in the length up to 2097152 (11011111
11111111 11111111), and so on. Although, in practice, I doubt one would
encounter over 2 megabytes of a single repeated byte, but nevertheless it is
infinitely scalable if you need to. We might make the encoding even more
efficient by considering the offsets from the previous maximums (e.g. 129
would be encoded as 10000000 00000000 instead of 10000000 10000000), thus
increasing the range of values a little, but the practical effect would be
negligible compared to the amount of operations we'd need to add every time
to do this.
So, for the case of our static frame, how well could we compress a single
channel with the unary-coded RLE (or should I call it RLUE - run-length
unary encoding?) instead of fixed length RLE? Let's encode the length of
160000 with our unary coding. First, we need to subtract 1 from the value
itself (159999) and convert it into binary, and we get 100111000011111111.
Nice. Then, we split it into bytes starting from the LSB: 10 01110000
11111111. So, we get three bytes, which means we need to use two ones and
the zero delimiter at the start. So, our RLUE value of 160000 becomes
11000010 01110000 11111111, or C2 70 FF in hex. Now, add the color byte
itself, and we get 4 bytes per channel. We have compressed an entire
480000-byte static frame into 12 bytes with this improved RLUE technique
that, if you think about it, doesn't really introduce any additional
complexity into the encoding or decoding process compared to the fixed RLE
variant.
Of course, this is an ideal situation. In reality, the runs normally will be
much shorter (although they may be longer than the usual 128 or 256 bytes,
so the RLUE approach is still good to use). But what's going to happen even
more frequently is that every following byte of your data will **slightly**
differ from the previous one and that's why a perfect run will not be
shaped. So, the next natural question would be: how can we make use of the
fact that the bytes are close to each other in order to compress them
better? Well, the answer is delta encoding. Instead of writing the current
byte value to the output stream, we write the difference between the current
and the previous bytes (at the beginning of the stream, we assume the
previous byte is always 0). Now, in case of equal or strictly sequential
data (like 1, 3, 5, 7, 9, 11...) delta encoding turns most of it into the
perfect run (1, 2, 2, 2, 2, 2...), but in case of slightly varying data
we're interested in (like 77, 78, 75, 81, 80, 77, 76, 79... ) delta encoding
alone (77, 1, -3, 6, -1, -3, -1, 3...) won't be sufficient enough to achieve
any compression. So, we must combine deltas with another simple code that
represents integer numbers closer to zero with less bits than the numbers
further from zero.
Since it's just single bytes we're dealing with and, while encoding deltas,
we always subtract the previous value from the current one, wrapping it
around if necessary, the numbers we must be able to encode with whatever we
choose range from -128 to 127. Since most compression codes were designed
for non-negative numbers, a more practical approach would be to remap our
delta d to the 0..255 range using the following formula: 2d if d>=0, -2d-1
if d<0. This way, deltas with small absolute values will still be closer to
zero. Some codes also don't like encoding zeros, so we add 1 and get the
following code points: 2d+1 if d>=0, -2d if d<0. So now, once we have
converted every delta to an integer from 0 to 255 or from 1 to 256, what do
we choose?
After a long train of thought, I'm leaning towards the Golomb-Rice code. In
the variant I know it, it's a clever application of our old friend unary
encoding. Here, it doesn't apply to the entire number though, but only to
the upper part of it. I.e. we split the bit representation of a number into
the quotient and remainder, where remainder is the lower m bits (usually
these codes are named after M = 2**m, e.g. GR8 for 3-bit remainder, GR16 for
4-bit remainder and so on), and we encode the quotient in unary and then
write the remainder as is. Decoding is also straightforward here: we read
the unary quotient representation until the first zero bit, then the known
amount of bits in the remainder, and reconstruct the number by decoding the
quotient from unary and just concatenating the remainder. Compared to the
pure unary and some other encoding schemes (like Exp-Golomb or Elias code
family), Golomb-Rice code allows us to adjust the tradeoff between fixing
the bit length of the smaller values and expanding the bit length of the
larger ones. This is why this kind of encoding is especially good for
encoding deltas: the smaller, the better.
Again, remember that we aim for simplicity here, so we can afford losing some
compression ratio but we must produce byte-aligned output. So, if we, for
example, read four bytes of input from the stream, we can only encode them
to 1, 2, 3 or 4 bytes of output, nothing in between. This is why, to make
use of GR codes, we still have to read input stream data into some buffer.
As you can see from the following example, it may be small but still needs
to be present. Note that for the decoder to know how many Golomb codewords
to read next, the buffer length must be passed beforehands as well, and we
can do this in the same encoding. Note that byte alignment doesn't hamper
the code readability because we know how many bits to read at the beginning
of every codeword: everything up to the first 0 and then exactly 3, 4 or 5
bits if we're using GR8, GR16 or GR32 respectively.
Example:
Input stream buffer data: ababagalamaga
Data length in bytes: 13
Data length encoded in GR8: 10101
Data length encoded in GR16: 01101
Data length encoded in GR32: 001101
Total source bit length: 104 bits
Data ASCII: 97 98 97 98 97 103 97 108 97 109 97 103 97
Data deltas: 97 1 -1 1 -1 6 -6 11 -11 12 -12 6 -6
Absolute deltas (0-based): 194 2 1 2 1 12 11 22 21 24 23 12 11
GR8 bitstream: 10101 1111111111111111111111110010 0010 0001 0010 0001 10100
10011 110110 110101 1110000 110111 10100 10011
GR8 bit length: 94 (90.4% of original)
GR8 bit length, byte-aligned: 96 (92.3% of original)
GR16 bitstream: 01101 11111111111100010 00010 00001 00010 00001 01100 01011
100110 100101 101000 100111 01100 01011
GR16 bit length: 86 (82.7% of original)
GR16 bit length, byte-aligned: 88 (84.6% of original)
GR32 bitstream: 001101 111111000010 000010 000001 000010 000001 001100 001011
010110 010101 011000 010111 001100 001011
GR32 bit length: 90 (86.5% of original)
GR32 bit length, byte-aligned: 96 (92.3% of original)
What conclusion can we draw from this? Actually, none, except that delta+GR16
encoding performs better at compressing ababagalamagas, that we know for
sure. The longer buffer we can capture and the closer its data bytes are to
each other, the better the results will be.
Now, a really good question is: which order should we apply all these
techniques in? The most obvious answer is that it fully depends on the
nature of your data. But what if we don't know? Like, in our case with video
frames, we can estimate their length but that's it, we don't know if they
are going to be mostly static colors or not. If we are compressing text data
(like source codes in a shar archive), there is absolutely no sense in even
trying to apply any RLE compression, while GR-encoded deltas still could be
of some use. If we're working with something like UTF-encoded text or some
data that already has undergone fixed-sized RLE, it might make sense to not
process single bytes, but pairs or even triplets of them instead. But that's
something I'll tell you about next time, when all these thoughts are sooner
or later going to take some practical shape.
--- Luxferre ---