Text Segmentation - Automatic Segmentation Approaches

Automatic Segmentation Approaches

Automatic segmentation is the problem in natural language processing of implementing a computer process to segment text.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

  • Manual analysis of text and writing custom software
  • Annotate the sample corpus with boundary information and use Machine Learning

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

Read more about this topic:  Text Segmentation

Famous quotes containing the words automatic and/or approaches:

    Predictions of the future are never anything but projections of present automatic processes and procedures, that is, of occurrences that are likely to come to pass if men do not act and if nothing unexpected happens; every action, for better or worse, and every accident necessarily destroys the whole pattern in whose frame the prediction moves and where it finds its evidence.
    Hannah Arendt (1906–1975)

    You should approach Joyce’s Ulysses as the illiterate Baptist preacher approaches the Old Testament: with faith.
    William Faulkner (1897–1962)