Comparison of Unicode Encodings - Size Issues

Size Issues

UTF-32/UCS-4 requires four bytes to encode any character. Since characters outside the basic multilingual plane (BMP) are typically rare, a document encoded in UTF-32 will often be nearly twice as large as its UTF-16/UCS-2–encoded equivalent because UTF-16 uses two bytes for the characters inside the BMP, or four bytes otherwise.

UTF-8 uses between one and four bytes to encode a character. It requires one byte for ASCII characters, making it half the space of UTF-16 for texts consisting only of ASCII. For other Latin characters and many non-Latin scripts it requires two bytes, the same as UTF-16. Only a few frequently used Western characters in the range U+0800 to U+FFFF, such as the € sign U+20AC, require three bytes in UTF-8. Characters outside of the BMP above U+FFFF need four bytes in UTF-8 and UTF-16.

The conservation of bytes in encoding files to a Unicode transformation format (UTF) depends on encoded code points, namely, blocks from which those code points are drawn. Say, it depends on the scripts in use. For example, UTF-16 use less space than UTF-32 only for characters from BMP, which are though overwhelmingly most common of all Unicode. In the same way using characters predominantly from the UTF-8 scripts makes UTF-8 more space efficient than UTF-16. The UTF-8 scripts are those scripts where UTF-8 only requires fewer than three bytes per character (only one byte for the ASCII-equivalent Basic Latin block, digits and most punctuation marks) and include: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, N'Ko, and the IPA and other Latin-based phonetic alphabets.

All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes.

For seven-bit environments, UTF-7 is more space efficient than the combination of other Unicode encodings with quoted-printable or base64 for almost all types of text (see "Seven-bit environments" below).

Read more about this topic:  Comparison Of Unicode Encodings

Famous quotes containing the words size and/or issues:

    Crotchless trouser allows wearer to show private parts in public. Neoprene-coated nylon pack cloth is stain resistant, water repellent and tickles thighs when walking. Tan-olive shade goes with most fetishes. Adjustable straps attach to belt for good fit and easy up-down. Pant is suitable for fast exposures as well as extended engagements. One size fits all.
    Alfred Gingold, U.S. humorist. Items From Our Catalogue, “Flasher’s Pants,” Avon Books (1982)

    The “universal moments” of child rearing are in fact nothing less than a confrontation with the most basic problems of living in society: a facing through one’s children of all the conflicts inherent in human relationships, a clarification of issues that were unresolved in one’s own growing up. The experience of child rearing not only can strengthen one as an individual but also presents the opportunity to shape human relationships of the future.
    Elaine Heffner (20th century)