(2023-05-01) Thoughts about stream compression, part 2 ------------------------------------------------------ Yes, finally, on the first day of May, I have something new to tell you. In fact, since the first compression-related post, I also had experimented with BWT (Burrows-Wheeler transformation) and MTF (move-to-front transformation) along with my ultimate RLUE scheme in different combinations, and come to various interesting results, including being able to compress a 2-second static yellow frame video in the raw YUV4MPEG2 4:4:4 format that weighs 86400399 bytes into 119 bytes and successfully decompress it back (and yes, RLUE+BWT+RLUE combo beats gzip, RLUE+gzip or bzip2 in this scenario), but then I felt something else was missing: the final step to compress the excess entropy. At first I really thought of using Golomb-Rice codes (the ones I mentioned and tinkered with a bit in the first part) but then, obviously, practical common sense won and I decided to not introduce any additional complexity by operating on individual bits. I wasn't fully satisfied with ready-made easy solutions like LZP either (although both BWT and LZP are plausible when you aren't too memory-restricted). But then, what to use instead? Well, I have devised an encoding scheme called 4PE, which conveniently means both "four-pair encoding" and "4-bit pair encoding". And now you're going to understand why. The main idea of 4PE is very simple. All byte values are divided into two categories: from 0 to 15 and from 16 to 255. The upper category is the boring one: if a byte is from 16 to 255, we do absolutely nothing with it during both encoding and decoding stage. We must, however, record this fact somewhere, and I'll get to that in a moment. If there is a pair of bytes A and B which are _both_ from 0 to 15, we can encode this pair as a single byte: [4 bits of A][4 bits of B]. With both bytes being e.g. 1, the encoded value becomes, you guessed it, 0b10001 = 31. This is why, for any bytes starting with 16 in the encoded stream, we need a way to distinguish whether it's a literal or an encoded value. And the most compact way to do this is with a header byte, which itself just contains the positions of the following (up to) 8 bytes in the encoded stream are to be treated as literals. As the header byte is mandatory for each 8-byte block, this means every 8 bytes of input can be encoded into 5 (in an ideal situation) to 9 (in the worst case scenario) bytes of output. To mostly shift our input values into the lower spectrum, we can use MTF, deltas and so on, but I'll review practical applications of this scheme in my next post. By the way, what's the position order of bits in the header byte? Theoretically, it doesn't matter as long as the encoder and the decoder agree on this, but I recommend LSB to MSB as it takes less CPU instructions to process. In other words, the recommended header-based stream block format is like this: {lf7 lf6 lf5 lf4 lf3 lf2 lf1 lf0} {b0} [{b1}] [{b2}] ..., where every lfN is a flag bit to tell the decoder whether or not to treat the corresponding input byte bN as a literal. Now that you know the format, here are the algorithms. Let's start with the 4PE encoder. 1. Allocate two 8-byte blocks I and O. 2. Attempt to read 8 bytes from the input stream into the block I. Record the actual amount of read bytes as IL. If no bytes are read (IL = 0), halt the algorithm. 3. Initialize the processed amount OL and the header byte H by setting them to 0. 4. For each input byte B at position P in block I (total IL bytes), repeat steps 5 to 9. 5. If P is less than IL - 1, also read the next byte C = I[P+1]. If both B and C are less than 16, go to step 6. If any of these conditions is false, go to step 7. 6. Set the O[OL] element to the value of B * 16 + C and increment P by 1. Go to step 9. 7. Set the OL'th least significant bit of H to 1. 8. Set the O[OL] element to the value of B. 9. Increment OL by 1. 10. If OL is above zero, emit the value of H and then OL elements of O in their order, all as plain bytes into the output stream. Go to step 2. Now, let's take a look at the 4PE decoder. 1. Set the header expectation flag E to 1. Initialize header byte H, input counter IC and output counter OC by setting them to 0. 2. Read a single byte B from the input. If unsuccessful, halt the algorithm. 3. If the flag E is 1, set the header value H to the value of B and go to step 8. 4. If the IC'th least significant bit of H is set to 1, emit the value of B and go to step 7. 5. Emit the four most significant bits of B as a separate byte value. 6. Emit the four least significant bits of B as a separate byte value. Increment OC. 7. Increment IC and OC. 8. If OC is more than or equal to 8, go to step 1, otherwise set the flag E to 0 and go to step 2. Besides being elementary to implement and suitable for pure stream processing, 4PE also has some interesting properties. For example, in 4PE output, you can visually determine how many bytes you'll need to read to decode every block just by looking at the header byte and its binary representation from right to left: 0 means 2 bytes and 1 means 1 byte, and the limit is 8. That's why, for instance, 0xA0, 0x70 and 0x00 are effectively the same header byte that means that four pair-packed values follow. You can even store additional information in the upper part as it's totally unused... but only in this case. The more ones are there in the lower part of the header byte, the more significant bits the upper part has too. At the other extreme end, the header value 255, which is 0xFF or 0b11111111, means that 8 bytes follow and all are literal, none of them is packed which is the worst case scenario. Fortunately, it only takes a single eligible byte pair to break even, which is why 4PE can really benefit from techniques like move-to-front transformation if the data is correlated enough. Well... That's all folks. Next time, I'll show how we can tie all this altogether to create real-life compression pipelines. Stay tuned! --- Luxferre ---