Big-endian - Endianness in Files and Byte Swap

Endianness in Files and Byte Swap

Endianness is a problem when a binary file created on a computer is read on another computer with different endianness. Some compilers have built-in facilities to deal with data written in other formats. For example, the Intel Fortran compiler supports the non-standard CONVERT specifier, so a file can be opened as

OPEN(unit,CONVERT='BIG_ENDIAN',...)

or

OPEN(unit,CONVERT='LITTLE_ENDIAN',...)

Some compilers have options to generate code that globally enables the conversion for all file IO operations. This allows one to reuse code on a system with the opposite endianness without having to modify the code itself. If the compiler does not support such conversion, the programmer needs to swap the bytes via ad hoc code.

Fortran sequential unformatted files created with one endianness usually cannot be read on a system using the other endianness because Fortran usually implements a record (defined as the data written by a single Fortran statement) as data preceded and succeeded by count fields, which are integers equal to the number of bytes in the data. An attempt to read such file on a system of the other endianness then results in a run-time error, because the count fields are incorrect. This problem can be avoided by writing out sequential binary files as opposed to sequential unformatted.

Unicode text can optionally start with a byte order mark (BOM) to signal the endianness of the file or stream. Its code point is U+FEFF. In UTF-32 for example, a big-endian file should start with 00 00 FE FF. In a little-endian file these bytes are reversed.

Application binary data formats, such as for example MATLAB .mat files, or the .BIL data format, used in topography, are usually endianness-independent. This is achieved by storing the data always in one fixed endianness, or carrying with the data a switch to indicate which endianness the data was written with. When reading the file, the application converts the endianness, transparently to the user.

This is the case of TIFF image files, which instructs in its header about endianness of their internal binary integers. If a file starts with the signature "MM" it means that integers are represented as big-endian, while "II" means little-endian. Those signatures need a single 16-bit word each, and they are palindromes (that is, they read the same forwards and backwards), so they are endianness independent. "I" stands for Intel and "M" stands for Motorola, the respective CPU providers of the IBM PC compatibles and Apple Macintosh platforms in the 1980s. Intel CPUs are little-endian, while Motorola 680x0 CPUs are big-endian. This explicit signature allows a TIFF reader program to swap bytes if necessary when a given file was generated by a TIFF writer program running on a computer with a different endianness.

The LabVIEW programming environment, though most commonly installed on Windows machines, was first developed on a Macintosh, and uses Big Endian format for its binary numbers, while most Windows programs use Little Endian format.

Note that since the required byte swap depends on the size of the numbers stored in the file (two 2-byte integers require a different swap than one 4-byte integer), the file format must be known to perform endianness conversion.

Read more about this topic:  Big-endian

Famous quotes containing the words files and/or swap:

    Here files of pins extend their shining rows,
    Puffs, powders, patches, bibles, billet-doux.
    Alexander Pope (1688–1744)

    If we should swap a good library for a second-rate stump speech and not ask for boot, it would be thoroughly in tune with our hearts. For deep within each of us lies politics. It is our football, baseball, and tennis rolled into one. We enjoy it; we will hitch up and drive for miles in order to hear and applaud the vitriolic phrases of a candidate we have already reckoned we’ll vote against.
    —Federal Writers’ Project Of The Wor, U.S. public relief program (1935-1943)