Text Segmentation - Automatic Segmentation Approaches

Automatic Segmentation Approaches

Automatic segmentation is the problem in natural language processing of implementing a computer process to segment text.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

  • Manual analysis of text and writing custom software
  • Annotate the sample corpus with boundary information and use Machine Learning

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

Read more about this topic:  Text Segmentation

Famous quotes containing the words automatic and/or approaches:

    What we learn for the sake of knowing, we hold; what we learn for the sake of accomplishing some ulterior end, we forget as soon as that end has been gained. This, too, is automatic action in the constitution of the mind itself, and it is fortunate and merciful that it is so, for otherwise our minds would be soon only rubbish-rooms.
    Anna C. Brackett (1836–1911)

    The closer a man approaches tragedy the more intense is his concentration of emotion upon the fixed point of his commitment, which is to say the closer he approaches what in life we call fanaticism.
    Arthur Miller (b. 1915)