Comparison of Unicode Encodings - Size Issues

Size Issues

UTF-32/UCS-4 requires four bytes to encode any character. Since characters outside the basic multilingual plane (BMP) are typically rare, a document encoded in UTF-32 will often be nearly twice as large as its UTF-16/UCS-2–encoded equivalent because UTF-16 uses two bytes for the characters inside the BMP, or four bytes otherwise.

UTF-8 uses between one and four bytes to encode a character. It requires one byte for ASCII characters, making it half the space of UTF-16 for texts consisting only of ASCII. For other Latin characters and many non-Latin scripts it requires two bytes, the same as UTF-16. Only a few frequently used Western characters in the range U+0800 to U+FFFF, such as the € sign U+20AC, require three bytes in UTF-8. Characters outside of the BMP above U+FFFF need four bytes in UTF-8 and UTF-16.

The conservation of bytes in encoding files to a Unicode transformation format (UTF) depends on encoded code points, namely, blocks from which those code points are drawn. Say, it depends on the scripts in use. For example, UTF-16 use less space than UTF-32 only for characters from BMP, which are though overwhelmingly most common of all Unicode. In the same way using characters predominantly from the UTF-8 scripts makes UTF-8 more space efficient than UTF-16. The UTF-8 scripts are those scripts where UTF-8 only requires fewer than three bytes per character (only one byte for the ASCII-equivalent Basic Latin block, digits and most punctuation marks) and include: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, N'Ko, and the IPA and other Latin-based phonetic alphabets.

All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes.

For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text (see "Seven-bit environments" below).

Read more about this topic:  Comparison Of Unicode Encodings

Famous quotes containing the words size and/or issues:

    Men of genius are not quick judges of character. Deep thinking and high imagining blunt that trivial instinct by which you and I size people up.
    Max Beerbohm (1872–1956)

    Cynicism formulates issues clearly, but only to dismiss them.
    Mason Cooley (b. 1927)