Comparison of Unicode Encodings - Processing Issues

Processing Issues

For processing, a format should be easy to search, truncate, and generally process safely. All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode code point. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB 18030 do not.

Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters. If you are working with a particular API heavily and that API has standardised on a particular Unicode encoding, it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server-side software, it may simplify matters to use the same format for processing that you are communicating in.

UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. However, using UTF-16 makes characters outside the Basic Multilingual Plane a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.

If any stored data is in UTF-8 (such as file contents or names), it is very difficult to write a system that uses UTF-16 or UTF-32 as an api. This is due to the often-overlooked fact that the byte array used by UTF-8 can physically contain invalid sequences. For instance it is impossible to fix an invalid UTF-8 filename using a UTF-16 api, as no possible UTF-16 string will translate to that invalid filename. The opposite is not true, it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment. (An unfortunate but far more common "solution" used by UTF-16 systems is to interpret the UTF-8 as some other encoding such as cp1252 and ignore the mojibake for any non-ASCII data)

Read more about this topic:  Comparison Of Unicode Encodings

Famous quotes containing the word issues:

    The current flows fast and furious. It issues in a spate of words from the loudspeakers and the politicians. Every day they tell us that we are a free people fighting to defend freedom. That is the current that has whirled the young airman up into the sky and keeps him circulating there among the clouds. Down here, with a roof to cover us and a gasmask handy, it is our business to puncture gasbags and discover the seeds of truth.
    Virginia Woolf (1882–1941)