Details
All numbers in this section are hexadecimal, and all ranges are inclusive.
Code points from U+0000
to U+0020
are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021
through U+D7FF
and U+E000
through U+10FFFF
) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020
). The initial state is U+0040
. The normalization mapping is as follows:
Code range | Normalized code point | Notes |
---|---|---|
U+3040 to U+309F |
U+3070 |
Hiragana |
U+4E00 to U+9FA5 |
U+7711 |
Unihan |
U+AC00 to U+D7A3 |
U+C1D1 |
Hangul |
U+0020 |
Space | |
U+hhhh00 to U+hhhh7F |
U+hhhh40 |
middle of 128 |
U+hhhh80 to U+hhhhFF |
U+hhhhC0 |
middle of 128 |
The difference between the current code point and the normalized previous code point is encoded as follows:
Difference range | Byte sequence range |
---|---|
-10FF9F to -2DD0D |
21 F0 58 D9 to 21 FF FF FF |
-2DD0C to -2912 |
22 01 01 to 24 FF FF |
-2911 to -41 |
25 01 to 4F FF |
-40 to 3F |
50 to CF |
40 to 2910 |
D0 01 to FA FF |
2911 to 2DD0B |
FB 01 01 to FD FF FF |
2DD0C to 10FFBF |
FE 01 01 01 to FE 19 B4 54 |
Each byte range is lexicographically ordered with the following thirteen byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20
. For example, the byte sequence FC 06 FF
, coding for a difference of 1156B
, is immediately followed by the byte sequence FC 10 01
, coding for a difference of 1156C
.
Any ASCII input U+0000
to U+007F
excluding space U+0020
resets the encoder to U+0040
. Because the above mentioned values cover line end code points U+000D
and U+000A
as is (0D 0A
), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document.
BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code 0xFF
. When a decoder finds this octet it resets its state to U+0040
as for a line end. The use of 0xFF
reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the binary order.
The optional use of a signature U+FEFF
at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence FB EE 28
, changes the initial state U+0040
to U+FE80
. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF
) could avoid this effect, but the BOCU-1 specification does not recommend this practice.
In theory UTF-1 and UTF-8 could encode the original UCS-4 set with 31 bits up to 7FFFFFFF
. BOCU-1 and UTF-16 can encode the modern Unicode set from U+0000
to U+10FFFF
. Excluding the thirteen protected code points encoded as single octets BOCU-1 can use octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte 0xFF
is not protected and can occur as trail byte.
Read more about this topic: Binary Ordered Compression For Unicode
Famous quotes containing the word details:
“Working women today are trying to achieve in the work world what men have achieved all alongbut men have always had the help of a woman at home who took care of all the other details of living! Today the working woman is also that woman at home, and without support services in the workplace and a respect for the work women do within and outside the home, the attempt to do both is taking its tollon women, on men, and on our children.”
—Jeanne Elium (20th century)
“Anyone can see that to write Uncle Toms Cabin on the knee in the kitchen, with constant calls to cooking and other details of housework to punctuate the paragraphs, was a more difficult achievement than to write it at leisure in a quiet room.”
—Anna Garlin Spencer (18511931)
“Then he told the news media
the strange details of his death
and they hammered him up in the marketplace
and sold him and sold him and sold him.
My death the same.”
—Anne Sexton (19281974)