Controlled Vocabulary - Indexing Languages

Indexing Languages

There are three main types of indexing languages.

  • Controlled indexing language - Only approved terms can be used by the indexer to describe the document
  • Natural language indexing language - Any term from the document in question can be used to describe the document.
  • Free indexing language - Any term (not only from the document) can be used to describe the document.

When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the level of detail in which the document is described. For example using low indexing exhaustivity, minor aspects of the work will not be described with index terms. In general the higher the indexing exhaustivity, the more terms indexed for each document.

In recent years free text search as a means of access to documents has become popular. This involves using natural language indexing with an indexing exhaustively set to maximum (every word in the text is indexed). Many studies have been done to compare the efficiency and effectiveness of free text searches against documents that have been indexed by experts using a few well chosen controlled vocabulary descriptors.

Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to reduce irrelevant items in the retrieval list. These irrelevant items (false positives) are often caused by the inherent ambiguity of natural language. Take the English word football for example. Football is the name given to a number of different team sports. Worldwide the most popular of these team sports is Association football, which also happens to be called soccer in several countries. The English language word football is also applied to Rugby football (Rugby union and rugby league), American football, Australian rules football, Gaelic football, and Canadian football. A search for football therefore will retrieve documents that are about several completely different sports. Controlled vocabulary solves this problem by tagging the documents in such a way that the ambiguities are eliminated.

Compared to free text searching, the use of a controlled vocabulary can dramatically increase the performance of an information retrieval system, if performance is measured by precision (the percentage of documents in the retrieval list that are actually relevant to the search topic).

In some cases controlled vocabulary can enhance recall as well, because unlike natural language schemes, once the correct authorized term is searched, you don't need to worry about searching for other terms that might be synonyms of that term.

However, a controlled vocabulary search may also lead to unsatisfactory recall, in that it will fail to retrieve some documents that are actually relevant to the search question.

This is particularly problematic when the search question involves terms that are sufficiently tangential to the subject area such that the indexer might have decided to tag it using a different term (but the searcher might consider the same). Essentially, this can be avoided only by an experienced user of controlled vocabulary whose understanding of the vocabulary coincides with the way it is used by the indexer.

Another possibility is that the article is just not tagged by the indexer because indexing exhaustivity is low. For example an article might mention football as a secondary focus, and the indexer might decide not to tag it with "football" because it is not important enough compared to the main focus. But it turns out that for the searcher that article is relevant and hence recall fails. A free text search would automatically pick up that article regardless.

On the other hand free text searches have high exhaustivity (you search on every word) so it has potential for high recall (assuming you solve the problems of synonyms by entering every combination) but will have much lower precision.

Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the authorized terms available might not be available if they are not updated regularly. Even in the best case scenario, controlled language is often not as specific as using the words of the text itself. Indexers trying to choose the appropriate index terms might misinterpret the author, while a free text search is in no danger of doing so, because it uses the author's own words.

The use of controlled vocabularies can be costly compared to free text searches because human experts or expensive automated systems are necessary to index each entry. Furthermore, the user has to be familiar with the controlled vocabulary scheme to make best use of the system. But as already mentioned, the control of synonyms, homographs can help increase precision.

Numerous methodologies have been developed to assist in the creation of controlled vocabularies, including faceted classification, which enables a given data record or document to be described in multiple ways.

Read more about this topic:  Controlled Vocabulary

Famous quotes containing the word languages:

    The less sophisticated of my forbears avoided foreigners at all costs, for the very good reason that, in their circles, speaking in tongues was commonly a prelude to snake handling. The more tolerant among us regarded foreign languages as a kind of speech impediment that could be overcome by willpower.
    Barbara Ehrenreich (b. 1941)