(2023-05-01) Thoughts about stream compression, part 2
------------------------------------------------------
Yes, finally, on the first day of May, I have something new to tell you. In
fact, since the first compression-related post, I also had experimented with 
BWT (Burrows-Wheeler transformation) and MTF (move-to-front transformation) 
along with my ultimate RLUE scheme in different combinations, and come to 
various interesting results, including being able to compress a 2-second 
static yellow frame video in the raw YUV4MPEG2 4:4:4 format that weighs 
86400399 bytes into 119 bytes and successfully decompress it back (and yes, 
RLUE+BWT+RLUE combo beats gzip, RLUE+gzip or bzip2 in this scenario), but 
then I felt something else was missing: the final step to compress the 
excess entropy. At first I really thought of using Golomb-Rice codes (the 
ones I mentioned and tinkered with a bit in the first part) but then, 
obviously, practical common sense won and I decided to not introduce any 
additional complexity by operating on individual bits. I wasn't fully 
satisfied with ready-made easy solutions like LZP either (although both BWT 
and LZP are plausible when you aren't too memory-restricted). But then, what 
to use instead? Well, I have devised an encoding scheme called 4PE, which 
conveniently means both "four-pair encoding" and "4-bit pair encoding". And 
now you're going to understand why.

The main idea of 4PE is very simple. All byte values are divided into two
categories: from 0 to 15 and from 16 to 255. The upper category is the 
boring one: if a byte is from 16 to 255, we do absolutely nothing with it 
during both encoding and decoding stage. We must, however, record this fact 
somewhere, and I'll get to that in a moment. If there is a pair of bytes A 
and B which are _both_ from 0 to 15, we can encode this pair as a single 
byte: [4 bits of A][4 bits of B]. With both bytes being e.g. 1, the encoded 
value becomes, you guessed it, 0b10001 = 31. This is why, for any bytes 
starting with 16 in the encoded stream, we need a way to distinguish whether 
it's a literal or an encoded value. And the most compact way to do this is 
with a header byte, which itself just contains the positions of the 
following (up to) 8 bytes in the encoded stream are to be treated as 
literals. As the header byte is mandatory for each 8-byte block, this means 
every 8 bytes of input can be encoded into 5 (in an ideal situation) to 9 
(in the worst case scenario) bytes of output. To mostly shift our input 
values into the lower spectrum, we can use MTF, deltas and so on, but I'll 
review practical applications of this scheme in my next post.

By the way, what's the position order of bits in the header byte?
Theoretically, it doesn't matter as long as the encoder and the decoder 
agree on this, but I recommend LSB to MSB as it takes less CPU instructions 
to process. In other words, the recommended header-based stream block format 
is like this:

{lf7 lf6 lf5 lf4 lf3 lf2 lf1 lf0} {b0} [{b1}] [{b2}] ...,

where every lfN is a flag bit to tell the decoder whether or not to treat the
corresponding input byte bN as a literal.

Now that you know the format, here are the algorithms. Let's start with the
4PE encoder.

1. Allocate two 8-byte blocks I and O.
2. Attempt to read 8 bytes from the input stream into the block I. Record the
actual amount of read bytes as IL. If no bytes are read (IL = 0), halt the 
algorithm.
3. Initialize the processed amount OL and the header byte H by setting them
to 0.
4. For each input byte B at position P in block I (total IL bytes), repeat
steps 5 to 9.
5. If P is less than IL - 1, also read the next byte C = I[P+1]. If both B
and C are less than 16, go to step 6. If any of these conditions is false, 
go to step 7.
6. Set the O[OL] element to the value of B * 16 + C and increment P by 1. Go
to step 9.
7. Set the OL'th least significant bit of H to 1.
8. Set the O[OL] element to the value of B.
9. Increment OL by 1.
10. If OL is above zero, emit the value of H and then OL elements of O in
their order, all as plain bytes into the output stream. Go to step 2.

Now, let's take a look at the 4PE decoder.

1. Set the header expectation flag E to 1. Initialize header byte H, input
counter IC and output counter OC by setting them to 0.
2. Read a single byte B from the input. If unsuccessful, halt the algorithm.
3. If the flag E is 1, set the header value H to the value of B and go to
step 8.
4. If the IC'th least significant bit of H is set to 1, emit the value of B
and go to step 7.
5. Emit the four most significant bits of B as a separate byte value.
6. Emit the four least significant bits of B as a separate byte value.
Increment OC.
7. Increment IC and OC.
8. If OC is more than or equal to 8, go to step 1, otherwise set the flag E
to 0 and go to step 2.

Besides being elementary to implement and suitable for pure stream
processing, 4PE also has some interesting properties. For example, in 4PE 
output, you can visually determine how many bytes you'll need to read to 
decode every block just by looking at the header byte and its binary 
representation from right to left: 0 means 2 bytes and 1 means 1 byte, and 
the limit is 8. That's why, for instance, 0xA0, 0x70 and 0x00 are 
effectively the same header byte that means that four pair-packed values 
follow. You can even store additional information in the upper part as it's 
totally unused... but only in this case. The more ones are there in the 
lower part of the header byte, the more significant bits the upper part has 
too. At the other extreme end, the header value 255, which is 0xFF or 
0b11111111, means that 8 bytes follow and all are literal, none of them is 
packed which is the worst case scenario. Fortunately, it only takes a single 
eligible byte pair to break even, which is why 4PE can really benefit from 
techniques like move-to-front transformation if the data is correlated 
enough.

Well... That's all folks. Next time, I'll show how we can tie all this
altogether to create real-life compression pipelines. Stay tuned!

--- Luxferre ---