Possible Source of Noisy Text
- World wide web: Poorly written text is found in web pages, online chat, blogs, wikis, discussion forums, newsgroups. Most of these data are unstructured and the style of writing is very different from, say, well-written news articles. Analysis for the web data is important because they are sources for market buzz analysis, market review, trend estimation, etc. Also, because of the large amount of data, it is necessary to find efficient methods of information extraction, classification, automatic summarization and analysis of these data.
- Contact centers: This is a general term for help desks, information lines and customer service centers operating in domains ranging from computer sales and support to mobile phones to apparels. On an average a person in the developed world interacts at least once a week with a contact center agent. A typical contact center agent handles over a hundred calls per day. They operate in various modes such as voice, online chat and E-mail. The contact center industry produces gigabytes of data in the form of E-mails, chat logs, voice conversation transcriptions, customer feedback, etc. A bulk of the contact center data is voice conversations. Transcription of these using state of the art automatic speech recognition results in text with 30-40% word error rate. Further, even written modes of communication like online chat between customers and agents and even the interactions over email tend to be noisy. Analysis of contact center data is essential for customer relationship management, customer satisfaction analysis, call modeling, customer profiling, agent profiling, etc., and it requires sophisticated techniques to handle poorly written text.
- Printed Documents: Many libraries, government organizations and national defence organizations have vast repositories of hard copy documents. To retrieve and process the content from such documents, they need to be processed using Optical Character Recognition. In addition to printed text, these documents may also contain handwritten annotations. OCRed text can be highly noisy depending on the font size, quality of the print etc. It can range from 2-3% word error rates to as high as 50-60% word error rates. Handwritten annotations can be particularly hard to decipher, and error rates can be quite high in their presence.
- Short Messaging Service (SMS): Language usage over computer mediated discourses, like chats, emails and SMS texts, significantly differs from the standard form of the language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language.
Read more about this topic: Noisy Text Analytics
Famous quotes containing the words source, noisy and/or text:
“Man is a stream whose source is hidden. Our being is descending into us from we know not whence. The most exact calculator has no prescience that somewhat incalculable may not balk the very next moment. I am constrained every moment to acknowledge a higher origin for events than the will I call mine.”
—Ralph Waldo Emerson (18031882)
“Whoever feels predestined to see and not to believe will find all believers too noisy and pushy: he guards against them.”
—Friedrich Nietzsche (18441900)
“Literature is not exhaustible, for the sufficient and simple reason that a single book is not. A book is not an isolated entity: it is a narration, an axis of innumerable narrations. One literature differs from another, either before or after it, not so much because of the text as for the manner in which it is read.”
—Jorge Luis Borges (18991986)