Collating Sequence - Automated Collation

Automated Collation

When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation algorithm that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.

The simplest kind of automated collation is based on the numerical codes of the symbols in a character set, such as ASCII coding (or any of its supersets such as Unicode), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking, lexicographical ordering). So a computer program might treat the characters a, b, C, d and $ as being ordered $, C, a, b, d (the corresponding ASCII codes are $ = 36, a = 97, b = 98, C = 67, and d = 100). Therefore strings beginning with C (or any other capital letter) would be sorted before strings with lower-case a, b, etc. This is sometimes called ASCIIbetical order.

The above method has the disadvantage that it can deviate from the standard alphabetical order that human users would expect, particularly due to the unexpected ordering of capital letters before all lower-case ones (and possibly the unexpected treatment of spaces and other non-letter characters). It is therefore often applied with certain refinements, the most obvious being the conversion of capitals to lowercase before comparing ASCII values.

In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, modified letters, digraphs, particular abbreviations and so on, as mentioned above under Alphabetical order, and in detail in the Alphabetical order article. Such algorithms are potentially quite complex, possibly requiring several passes through the text.

Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür.

A standard algorithm for collating any collection of strings composed of any standard Unicode symbols is the Unicode Collation Algorithm. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.

Read more about this topic: Collating Sequence

Famous quotes containing the words automated and/or collation:

“Nature is a self-made machine, more perfectly automated than any automated machine. To create something in the image of nature is to create a machine, and it was by learning the inner working of nature that man became a builder of machines.”
—Eric Hoffer (1902–1983)

“[T]ea, that uniquely English meal, that unnecessary collation at which no stimulants—neither alcohol nor meat—are served, that comforting repast of which to partake is as good as second childhood.”
—Angela Carter (1940–1992)