(2023-04-14) Some thoughts about simple stream compression ---------------------------------------------------------- This blog already contains my opinion about archive formats and why they must not be confused with compression formats. Well, now it's time to talk about the latter. But, unlike archive formats, I won't point to ready-made or established solutions or algorithms this time, and just will express some thoughts on how I would approach the problem of stream compression myself using the simplest techniques available. I'm emphasizing on the stream part because I want to stress the mandatory condition that the compressor or decompressor can only operate on the data it already encountered when reading the stream, and the fact the set of data it can operate on is limited. I.e. it cannot read all the data at once into memory to create symbol statistics tables, nor it can allocate unbound memory chunks during the process to adapt the already compressed or decompressed data. Stream compression always operates on what we have right here and now, plus maybe some small buffer whose size is always fixed regardless of the input length. I'm also emphasizing on the simplest techniques. That is, nothing more complex to implement than LZW or even LZP. And keep in mind that even these two can be hard to do in really memory-constrained conditions. Think on the scale of around 4096 bytes max of available RAM for both your program itself AND the buffers it may allocate. We are, however, allowed to run the data stream through any amount of these tiny encoders/decoders to get the final result. So, it's fine to use various simple techniques and combine them at the higher level. Let's start with the most obvious one - RLE, run-length encoding. The idea here is as old as this world: if the byte is the same as the previous one, don't write it to the output stream and just increment the counter, and as the byte changes, write the counter and the previous byte to the output stream. When it comes to real-life implementations, however, several questions arise, the first couple of them being: how do we store the length itself and how do we reliably separate the length from the payload so that the decoder won't confuse the two? Most RLE implementations I've seen just limit the maximum length of runs to 128, 256 or some other value that would conveniently fit into the fixed amount of bits. This is straightforward but quite inefficient. Suppose we're processing a 800x600 video frame with a static color, this means every channel in this frame (R, G, B or Y, Cr, Cb - doesn't matter) has 800*200 = 160000 identical bytes. If we adopt the simplest RLE compression scheme [len][byte] where [len] is also a byte from 0 to 255 (encoding the length from 1 to 256), this means we can split our 160000 bytes into 625 256-byte chunks, and this means we have to write 1250 bytes of output data (625 identical pairs of 0xff and whatever the actual value is) because we have limited the maximum run length to 256 by design. Now, if only there was a way to encode the length of the length itself, so that we could still separate it from the data byte, wouldn't waste the extras but also wouldn't have the upper limit... And it turns out there is a way to do this, moreover, the most popular and versatile text encoding in the world - UTF-8 - also uses it. This technique is called unary coding. In unary coding, the amount of leading 1 bits before the first 0 encodes the value, so zero bit itself acts as a delimiter here. In UTF-8, this amount of leading ones in the header byte indicates how many bytes the entire character will take. For our purposes, this amount would indicate how many extra bytes the length field will take. On top of that, we could borrow the principle from UTF-8 and only use a single byte for 128 byte long runs maximum instead of 256. I.e. the length of 128 would be encoded (in binary) as 01111111 (0 as the delimiter and 1111111 as the value), while the length of 129 would already be encoded in two bytes as 10000000 10000000 (1 as unary-coded extra length, 0 as the delimiter, and the value minus 1 aligned to the end of the sequence). The maximum run length one could encode in two bytes would then be 10111111 11111111, or 16384. The length of 16385 would be then encoded as 11000000 01000000 00000000, and three bytes would fit in the length up to 2097152 (11011111 11111111 11111111), and so on. Although, in practice, I doubt one would encounter over 2 megabytes of a single repeated byte, but nevertheless it is infinitely scalable if you need to. We might make the encoding even more efficient by considering the offsets from the previous maximums (e.g. 129 would be encoded as 10000000 00000000 instead of 10000000 10000000), thus increasing the range of values a little, but the practical effect would be negligible compared to the amount of operations we'd need to add every time to do this. So, for the case of our static frame, how well could we compress a single channel with the unary-coded RLE (or should I call it RLUE - run-length unary encoding?) instead of fixed length RLE? Let's encode the length of 160000 with our unary coding. First, we need to subtract 1 from the value itself (159999) and convert it into binary, and we get 100111000011111111. Nice. Then, we split it into bytes starting from the LSB: 10 01110000 11111111. So, we get three bytes, which means we need to use two ones and the zero delimiter at the start. So, our RLUE value of 160000 becomes 11000010 01110000 11111111, or C2 70 FF in hex. Now, add the color byte itself, and we get 4 bytes per channel. We have compressed an entire 480000-byte static frame into 12 bytes with this improved RLUE technique that, if you think about it, doesn't really introduce any additional complexity into the encoding or decoding process compared to the fixed RLE variant. Of course, this is an ideal situation. In reality, the runs normally will be much shorter (although they may be longer than the usual 128 or 256 bytes, so the RLUE approach is still good to use). But what's going to happen even more frequently is that every following byte of your data will **slightly** differ from the previous one and that's why a perfect run will not be shaped. So, the next natural question would be: how can we make use of the fact that the bytes are close to each other in order to compress them better? Well, the answer is delta encoding. Instead of writing the current byte value to the output stream, we write the difference between the current and the previous bytes (at the beginning of the stream, we assume the previous byte is always 0). Now, in case of equal or strictly sequential data (like 1, 3, 5, 7, 9, 11...) delta encoding turns most of it into the perfect run (1, 2, 2, 2, 2, 2...), but in case of slightly varying data we're interested in (like 77, 78, 75, 81, 80, 77, 76, 79... ) delta encoding alone (77, 1, -3, 6, -1, -3, -1, 3...) won't be sufficient enough to achieve any compression. So, we must combine deltas with another simple code that represents integer numbers closer to zero with less bits than the numbers further from zero. Since it's just single bytes we're dealing with and, while encoding deltas, we always subtract the previous value from the current one, wrapping it around if necessary, the numbers we must be able to encode with whatever we choose range from -128 to 127. Since most compression codes were designed for non-negative numbers, a more practical approach would be to remap our delta d to the 0..255 range using the following formula: 2d if d>=0, -2d-1 if d<0. This way, deltas with small absolute values will still be closer to zero. Some codes also don't like encoding zeros, so we add 1 and get the following code points: 2d+1 if d>=0, -2d if d<0. So now, once we have converted every delta to an integer from 0 to 255 or from 1 to 256, what do we choose? After a long train of thought, I'm leaning towards the Golomb-Rice code. In the variant I know it, it's a clever application of our old friend unary encoding. Here, it doesn't apply to the entire number though, but only to the upper part of it. I.e. we split the bit representation of a number into the quotient and remainder, where remainder is the lower m bits (usually these codes are named after M = 2**m, e.g. GR8 for 3-bit remainder, GR16 for 4-bit remainder and so on), and we encode the quotient in unary and then write the remainder as is. Decoding is also straightforward here: we read the unary quotient representation until the first zero bit, then the known amount of bits in the remainder, and reconstruct the number by decoding the quotient from unary and just concatenating the remainder. Compared to the pure unary and some other encoding schemes (like Exp-Golomb or Elias code family), Golomb-Rice code allows us to adjust the tradeoff between fixing the bit length of the smaller values and expanding the bit length of the larger ones. This is why this kind of encoding is especially good for encoding deltas: the smaller, the better. Again, remember that we aim for simplicity here, so we can afford losing some compression ratio but we must produce byte-aligned output. So, if we, for example, read four bytes of input from the stream, we can only encode them to 1, 2, 3 or 4 bytes of output, nothing in between. This is why, to make use of GR codes, we still have to read input stream data into some buffer. As you can see from the following example, it may be small but still needs to be present. Note that for the decoder to know how many Golomb codewords to read next, the buffer length must be passed beforehands as well, and we can do this in the same encoding. Note that byte alignment doesn't hamper the code readability because we know how many bits to read at the beginning of every codeword: everything up to the first 0 and then exactly 3, 4 or 5 bits if we're using GR8, GR16 or GR32 respectively. Example: Input stream buffer data: ababagalamaga Data length in bytes: 13 Data length encoded in GR8: 10101 Data length encoded in GR16: 01101 Data length encoded in GR32: 001101 Total source bit length: 104 bits Data ASCII: 97 98 97 98 97 103 97 108 97 109 97 103 97 Data deltas: 97 1 -1 1 -1 6 -6 11 -11 12 -12 6 -6 Absolute deltas (0-based): 194 2 1 2 1 12 11 22 21 24 23 12 11 GR8 bitstream: 10101 1111111111111111111111110010 0010 0001 0010 0001 10100 10011 110110 110101 1110000 110111 10100 10011 GR8 bit length: 94 (90.4% of original) GR8 bit length, byte-aligned: 96 (92.3% of original) GR16 bitstream: 01101 11111111111100010 00010 00001 00010 00001 01100 01011 100110 100101 101000 100111 01100 01011 GR16 bit length: 86 (82.7% of original) GR16 bit length, byte-aligned: 88 (84.6% of original) GR32 bitstream: 001101 111111000010 000010 000001 000010 000001 001100 001011 010110 010101 011000 010111 001100 001011 GR32 bit length: 90 (86.5% of original) GR32 bit length, byte-aligned: 96 (92.3% of original) What conclusion can we draw from this? Actually, none, except that delta+GR16 encoding performs better at compressing ababagalamagas, that we know for sure. The longer buffer we can capture and the closer its data bytes are to each other, the better the results will be. Now, a really good question is: which order should we apply all these techniques in? The most obvious answer is that it fully depends on the nature of your data. But what if we don't know? Like, in our case with video frames, we can estimate their length but that's it, we don't know if they are going to be mostly static colors or not. If we are compressing text data (like source codes in a shar archive), there is absolutely no sense in even trying to apply any RLE compression, while GR-encoded deltas still could be of some use. If we're working with something like UTF-encoded text or some data that already has undergone fixed-sized RLE, it might make sense to not process single bytes, but pairs or even triplets of them instead. But that's something I'll tell you about next time, when all these thoughts are sooner or later going to take some practical shape. --- Luxferre ---