Part-of-speech Tagging - Issues

Issues

While there is broad agreement about basic categories, a number of edge cases make it difficult to settle on a single "correct" set of tags, even in a single language such as English. For example, it is hard to say whether "fire" is functioning as an adjective or a noun in

the big green fire truck

A second important example is the use/mention distinction, as in the following example, where "blue" is clearly not functioning as an adjective (the Brown Corpus tag set appends the suffix "-NC" in such cases):

the word "blue" has 4 letters.

Words in a language other than that of the "main" text, are commonly tagged as "foreign", usually in addition to a tag for the role the foreign word is actually playing in context.

There are also many cases where POS categories and "words" do not map one to one, for example:

David's gonna don't vice versa first-cut cannot pre- and post-secondary look (a word) up

In the last example, "look" and "up" arguably function as a single verbal unit, despite the possibility of other words coming between them. Some tag sets (such as Penn) break hyphenated words, contractions, and possessives into separate tokens, thus avoiding some but far from all such problems.

It is unclear whether it is best to treat words such as "be", "have", and "do" as categories in their own right (as in the Brown Corpus), or as simply verbs (as in the LOB Corpus and the Penn Treebank). "be" has more forms than other English verbs, and occurs in quite different grammatical contexts, complicating the issue.

The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. It is largely similar to the earlier Brown Corpus and LOB Corpus tag sets, though much smaller. In Europe, tag sets from the Eagles Guidelines see wide use, and include versions for multiple languages.

POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions (this makes the tag sets for heavily inflected languages such as Greek and Latin very large; and makes tagging words in agglutinative languages such an Inuit virtually impossible. However, Petrov, D. Das, and R. McDonald ("A Universal Part-of-Speech Tagset" http://arxiv.org/abs/1104.2086) have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition, etc.). Whether a very small set of very broad tags, or a much larger set of more precise ones, is preferable, depends on the purpose at hand. Automatic tagging is easier on smaller tag-sets.

A different issue is that some cases are in fact ambiguous. Beatrice Santorini gives examples in "Part-of-speech Tagging Guidelines for the Penn Treebank Project," (3rd rev, June 1990 ), including the following (p. 32) case in which entertaining can function either as an adjective or a verb, and there is no evident way to decide:

The Duchess was entertaining last night.

Read more about this topic:  Part-of-speech Tagging

Famous quotes containing the word issues:

    The current flows fast and furious. It issues in a spate of words from the loudspeakers and the politicians. Every day they tell us that we are a free people fighting to defend freedom. That is the current that has whirled the young airman up into the sky and keeps him circulating there among the clouds. Down here, with a roof to cover us and a gasmask handy, it is our business to puncture gasbags and discover the seeds of truth.
    Virginia Woolf (1882–1941)

    I can never bring you to realize the importance of sleeves, the suggestiveness of thumb-nails, or the great issues that may hang from a boot-lace.
    Sir Arthur Conan Doyle (1859–1930)

    How to attain sufficient clarity of thought to meet the terrifying issues now facing us, before it is too late, is ... important. Of one thing I feel reasonably sure: we can’t stop to discuss whether the table has or hasn’t legs when the house is burning down over our heads. Nor do the classics per se seem to furnish the kind of education which fits people to cope with a fast-changing civilization.
    Mary Barnett Gilson (1877–?)