Text Segmentation - Automatic Segmentation Approaches

Automatic Segmentation Approaches

Automatic segmentation is the problem in natural language processing of implementing a computer process to segment text.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints. Effective natural language processing systems and text segmentation tools usually operate on text in specific domains and sources. As an example, processing text used in medical records is a very different problem than processing news articles or real estate advertisements.

The process of developing text segmentation tools starts with collecting a large corpus of text in an application domain. There are two general approaches:

  • Manual analysis of text and writing custom software
  • Annotate the sample corpus with boundary information and use Machine Learning

Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.

Read more about this topic:  Text Segmentation

Famous quotes containing the words automatic and/or approaches:

    The ruin of the human heart is self-interest, which the American merchant calls self-service. We have become a self- service populace, and all our specious comforts—the automatic elevator, the escalator, the cafeteria—are depriving us of volition and moral and physical energy.
    Edward Dahlberg (1900–1977)

    If I commit suicide, it will not be to destroy myself but to put myself back together again. Suicide will be for me only one means of violently reconquering myself, of brutally invading my being, of anticipating the unpredictable approaches of God. By suicide, I reintroduce my design in nature, I shall for the first time give things the shape of my will.
    Antonin Artaud (1896–1948)