Part-of-speech Tagging - Principle

Principle

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. This is not rare—in natural languages (as opposed to many artificial languages), a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the barmaid.

Performing grammatical tagging will indicate that "dogs" is a verb, and not the more common plural noun, since one of the words must be the main verb, and the noun reading is less likely following "sailor" (sailor !→ dogs). Semantic analysis can then extrapolate that "sailor" and "barmaid" implicate "dogs" as 1) in the nautical context (sailor→←barmaid) and 2) an action applied to the object "barmaid" ( dogs→barmaid). In this context, "dogs" is a nautical term meaning "fastens (a watertight barmaid) securely; applies a dog to".

"Dogged", on the other hand, can be either an adjective or a past-tense verb. Just which parts of speech a word can represent varies greatly.

Trained linguists can identify the grammatical parts of speech to various fine degrees depending on the tagging system. Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their "case" (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things.

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English, for example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech, and found that about as many words were ambiguous there as in English. A morphosyntactic descriptor in the case of morphologically rich languages can be expressed like Ncmsan, which means Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no.

Read more about this topic:  Part-of-speech Tagging

Famous quotes containing the word principle:

    The monk in hiding himself from the world becomes not less than himself, not less of a person, but more of a person, more truly and perfectly himself: for his personality and individuality are perfected in their true order, the spiritual, interior order, of union with God, the principle of all perfection.
    Thomas Merton (1915–1968)

    For me chemistry represented an indefinite cloud of future potentialities which enveloped my life to come in black volutes torn by fiery flashes, like those which had hidden Mount Sinai. Like Moses, from that cloud I expected my law, the principle of order in me, around me, and in the world.... I would watch the buds swell in spring, the mica glint in the granite, my own hands, and I would say to myself: “I will understand this, too, I will understand everything.”
    Primo Levi (1919–1987)

    Now, what I want is, Facts. Teach these boys and girls nothing but Facts. Facts alone are wanted in life. Plant nothing else, and root out everything else. You can only form the minds of reasoning animals upon Facts: nothing else will ever be of any service to them. This is the principle on which I bring up my own children, and this is the principle on which I bring up these children. Stick to Facts, sir!
    Charles Dickens (1812–1870)