List of Natural Language Processing Toolkits - Processes of NLP - Component Processes - Component Processes of Natural Language Understanding

Component Processes of Natural Language Understanding

  • Automatic document classification (text categorization) –
    • Automatic language identification –
  • Compound term processing – category of techniques that identify compound terms and match them to their definitions. Compound terms are built by combining two (or more) simple terms, for example "triple" is a single word term but "triple heart bypass" is a compound term.
  • Corpus processing –
    • Automatic acquisition of lexicon –
    • Text normalization –
    • Text simplification –
  • Deep linguistic processing –
  • Discourse analysis – includes a number of related tasks. One task is identifying the discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no questions, content questions, statements, assertions, orders, suggestions, etc.).
  • Information extraction –
    • Text mining – process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
      • Biomedical text mining – (also known as BioNLP), this is text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field drawing elements from natural language processing, bioinformatics, medical informatics and computational linguistics. There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.
      • Decision tree learning –
      • Sentence extraction –
    • Terminology extraction –
  • Latent semantic indexing –
  • Lemmatisation –
  • Morphological segmentation – separates words into individual morphemes and identifies the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.
  • Named entity recognition (NER) – given a stream of text, determines which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Note that, although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do not capitalize names that serve as adjectives.
  • Parsing – determines the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human).
    • Shallow parsing –
  • Part-of-speech tagging – given a sentence, determines the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Note that some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning.
  • Query expansion –
  • Relationship extraction – given a chunk of text, identifies the relationships among named entities (e.g. who is the wife of whom).
  • Sentence breaking (also known as sentence boundary disambiguation and sentence detection) – given a chunk of text, finds the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g. marking abbreviations).
  • Speech segmentation – given a sound clip of a person or people speaking, separates it into words. A subtask of speech recognition and typically grouped with it.
  • Stemming –
  • Text chunking –
  • Tokenization –
  • Topic segmentation and recognition – given a chunk of text, separates it into segments each of which is devoted to a topic, and identifies the topic of the segment.
  • Truecasing –
  • Word segmentation – separates a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language.
  • Word sense disambiguation – because many words have more than one meaning, word sense disambiguation is used to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary or from an online resource such as WordNet.
    • Automatic acquisition of sense-tagged corpora –

Read more about this topic:  List Of Natural Language Processing Toolkits, Processes of NLP, Component Processes

Famous quotes containing the words component, processes, natural and/or language:

    ... no one knows anything about a strike until he has seen it break down into its component parts of human beings.
    Mary Heaton Vorse (1874–1966)

    The higher processes are all processes of simplification. The novelist must learn to write, and then he must unlearn it; just as the modern painter learns to draw, and then learns when utterly to disregard his accomplishment, when to subordinate it to a higher and truer effect.
    Willa Cather (1873–1947)

    Typography is not only a technology but is in itself a natural resource or staple, like cotton or timber or radio; and, like any staple, it shapes not only private sense ratios but also patterns of communal interdependence.
    Marshall McLuhan (1911–1980)

    There’s language in her eye, her cheek, her lip,
    Nay, her foot speaks; her wanton spirits look out
    At every joint and motive of her body.
    William Shakespeare (1564–1616)