Some important information about UTF-8
--------------------------------------
The de-facto text encoding of today. Here's how codepoints can be represented
with variable length of bytes:

1. Single bytes represent codepoints from 0x00 to 0x7F (identical to 7-bit
ASCII)
2. Multibyte sequences (codepoints from 0x80 to 0x10FFFD):
   Header bytes: 0xC0 to 0xFD 
   - the number of '1' bits above the topmost '0' bit indicates the number of
bytes (including this one) in the whole sequence
   - the data payload starts _after_ the topmost '0' bit in this byte
   Trailer bytes: 0x80 to 0xBF
   - the data payload starts _after_ the topmost '0' bit in this byte
3. Invalid bytes: 0xFE, 0xFF - must never occur in a UTF-8 text

Surrogate pairs (representation of codepoints above 0xFFFF = 65535 with two
codepoints within this range):

1. Convert from a codepoint to the pair:
   lead = 0xD7C0 + (codepoint >> 10)
   trail = 0xDC00 + (codepoint & 0x3FF)
2. Convert from a pair to the codepoint:
   codepoint = (lead << 10) + trail - 0x35FDC00

--- Luxferre ---