Noisy Text Analytics - Possible Source of Noisy Text

Possible Source of Noisy Text

  • World wide web: Poorly written text is found in web pages, online chat, blogs, wikis, discussion forums, newsgroups. Most of these data are unstructured and the style of writing is very different from, say, well-written news articles. Analysis for the web data is important because they are sources for market buzz analysis, market review, trend estimation, etc. Also, because of the large amount of data, it is necessary to find efficient methods of information extraction, classification, automatic summarization and analysis of these data.
  • Contact centers: This is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chat and E-mail. The contact center industry produces gigabytes of data in the form of E-mails, chat logs, voice conversation transcriptions, customer feedback, etc. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error rate. Further, even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text.
  • Printed Documents: Many libraries, government organizations and national defence organizations have vast repositories of hard copy documents. To retrieve and process the content from such documents, they need to be processed using Optical Character Recognition. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% word error rates to as high as 50-60% word error rates. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence.
  • Short Messaging Service (SMS): Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language.

Read more about this topic:  Noisy Text Analytics

Famous quotes containing the words source, noisy and/or text:

    The belief in a supernatural source of evil is not necessary; men alone are quite capable of every wickedness.
    Joseph Conrad (1857–1924)

    A man who whinnies with noisy laughter, surpasses all the animals in vulgarity.
    Friedrich Nietzsche (1844–1900)

    The power of a text is different when it is read from when it is copied out.... Only the copied text thus commands the soul of him who is occupied with it, whereas the mere reader never discovers the new aspects of his inner self that are opened by the text, that road cut through the interior jungle forever closing behind it: because the reader follows the movement of his mind in the free flight of day-dreaming, whereas the copier submits it to command.
    Walter Benjamin (1892–1940)