Crlf - Unicode

Unicode

The Unicode standard defines a large number of characters that conforming applications should recognize as line terminators:

LF: Line Feed, U+000A
VT: Vertical Tab, U+000B
FF: Form Feed, U+000C
CR: Carriage Return, U+000D
CR+LF: CR (U+000D) followed by LF (U+000A)
NEL: Next Line, U+0085
LS: Line Separator, U+2028
PS: Paragraph Separator, U+2029

This may seem overly complicated compared to an approach such as converting all line terminators to a single character, for example LF. However, Unicode was designed to preserve all information when converting a text file from any existing encoding to Unicode and back. Therefore, Unicode should contain characters included in existing encodings. NEL is included in EBCDIC with code (0x15). NEL is also a C1 control set. As such, it is defined by ECMA 48, and recognize by encodings compliant with ISO-2022 which is equivalent to ECMA 35. C1 control set is also compatible with ISO-8859-1. The approach taken in the Unicode standard allows round-trip transformation to be information-preserving while still enabling applications to recognize all possible types of line terminators.

Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in UTF-8 and the code for NEL has been used as the ellipsis ('…') character in Windows-1252. For instance:

  • YAML no longer recognizes them as special in order to be compatible with JSON.
  • ECMAScript accepts LS and PS as line breaks, but considers U+0085 (NEL) white space, not a line break.
  • Microsoft Windows 2000 does not treat any of NEL, LS or PS as line-break in the default text editor Notepad
  • In Linux, a popular editor "gedit" treats LS and PS as newlines but does not for NEL.

Read more about this topic:  Crlf