Stemming Algorithm - Language Challenges

Language Challenges

While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.

Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun declensions), a Hebrew one is even more complex (due to non-catenative morphology, a writing system without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on.

Read more about this topic:  Stemming Algorithm

Famous quotes containing the words language and/or challenges:

    A president, however, must stand somewhat apart, as all great presidents have known instinctively. Then the language which has the power to survive its own utterance is the most likely to move those to whom it is immediately spoken.
    J.R. Pole (b. 1922)

    The approval of the public is to be avoided like the plague. It is absolutely essential to keep the public from entering if one wishes to avoid confusion. I must add that the public must be kept panting in expectation at the gate by a system of challenges and provocations.
    André Breton (1896–1966)