Variable-width Encoding - Unicode Variable-width Encodings

Unicode Variable-width Encodings

The Unicode standard has two variable-width encodings: UTF-8 and UTF-16 (it also has a fixed-width encoding, UTF-32). Originally, both Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16 bit and ISO 10646 being 32 bit. ISO 10646 provided a variable-width encoding called UTF-1, in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the Plan 9 operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-FD (now actually C2-F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see UTF-8 article), and trail units have the range 80-BF. The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4.

UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.

Read more about this topic:  Variable-width Encoding