Some important information about UTF-8 -------------------------------------- The de-facto text encoding of today. Here's how codepoints can be represented with variable length of bytes: 1. Single bytes represent codepoints from 0x00 to 0x7F (identical to 7-bit ASCII) 2. Multibyte sequences (codepoints from 0x80 to 0x10FFFD): Header bytes: 0xC0 to 0xFD - the number of '1' bits above the topmost '0' bit indicates the number of bytes (including this one) in the whole sequence - the data payload starts _after_ the topmost '0' bit in this byte Trailer bytes: 0x80 to 0xBF - the data payload starts _after_ the topmost '0' bit in this byte 3. Invalid bytes: 0xFE, 0xFF - must never occur in a UTF-8 text Surrogate pairs (representation of codepoints above 0xFFFF = 65535 with two codepoints within this range): 1. Convert from a codepoint to the pair: lead = 0xD7C0 + (codepoint >> 10) trail = 0xDC00 + (codepoint & 0x3FF) 2. Convert from a pair to the codepoint: codepoint = (lead << 10) + trail - 0x35FDC00 --- Luxferre ---