Comparison of Unicode Encodings - Size Issues

Size Issues

UTF-32/UCS-4 requires four bytes to encode any character. Since characters outside the basic multilingual plane (BMP) are typically rare, a document encoded in UTF-32 will often be nearly twice as large as its UTF-16/UCS-2–encoded equivalent because UTF-16 uses two bytes for the characters inside the BMP, or four bytes otherwise.

UTF-8 uses between one and four bytes to encode a character. It requires one byte for ASCII characters, making it half the space of UTF-16 for texts consisting only of ASCII. For other Latin characters and many non-Latin scripts it requires two bytes, the same as UTF-16. Only a few frequently used Western characters in the range U+0800 to U+FFFF, such as the € sign U+20AC, require three bytes in UTF-8. Characters outside of the BMP above U+FFFF need four bytes in UTF-8 and UTF-16.

The conservation of bytes in encoding files to a Unicode transformation format (UTF) depends on encoded code points, namely, blocks from which those code points are drawn. Say, it depends on the scripts in use. For example, UTF-16 use less space than UTF-32 only for characters from BMP, which are though overwhelmingly most common of all Unicode. In the same way using characters predominantly from the UTF-8 scripts makes UTF-8 more space efficient than UTF-16. The UTF-8 scripts are those scripts where UTF-8 only requires fewer than three bytes per character (only one byte for the ASCII-equivalent Basic Latin block, digits and most punctuation marks) and include: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, N'Ko, and the IPA and other Latin-based phonetic alphabets.

All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes.

For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text (see "Seven-bit environments" below).

Read more about this topic:  Comparison Of Unicode Encodings

Famous quotes containing the words size and/or issues:

    One writes of scars healed, a loose parallel to the pathology of the skin, but there is no such thing in the life of an individual. There are open wounds, shrunk sometimes to the size of a pin-prick but wounds still. The marks of suffering are more comparable to the loss of a finger, or the sight of an eye. We may not miss them, either, for one minute in a year, but if we should there is nothing to be done about it.
    F. Scott Fitzgerald (1896–1940)

    I can never bring you to realize the importance of sleeves, the suggestiveness of thumb-nails, or the great issues that may hang from a boot-lace.
    Sir Arthur Conan Doyle (1859–1930)