Text Mining - History

History

Labor-intensive manual text mining approaches first surfaced in the mid-1980s,Template:Http://www.ppc.sas.upenn.edu/cave.htm but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

The challenge of exploiting the large proportion of enterprise information that originates in "unstructured" form has been recognized for decades. It is recognized in the earliest definition of business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in "unstructured" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:

For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.

Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.

Read more about this topic:  Text Mining

Famous quotes containing the word history:

    Free from public debt, at peace with all the world, and with no complicated interests to consult in our intercourse with foreign powers, the present may be hailed as the epoch in our history the most favorable for the settlement of those principles in our domestic policy which shall be best calculated to give stability to our Republic and secure the blessings of freedom to our citizens.
    Andrew Jackson (1767–1845)

    The custard is setting; meanwhile
    I not only have my own history to worry about
    But am forced to fret over insufficient details related to large
    Unfinished concepts that can never bring themselves to the point
    Of being, with or without my help, if any were forthcoming.
    John Ashbery (b. 1927)

    Postmodernism is, almost by definition, a transitional cusp of social, cultural, economic and ideological history when modernism’s high-minded principles and preoccupations have ceased to function, but before they have been replaced with a totally new system of values. It represents a moment of suspension before the batteries are recharged for the new millennium, an acknowledgment that preceding the future is a strange and hybrid interregnum that might be called the last gasp of the past.
    Gilbert Adair, British author, critic. Sunday Times: Books (London, April 21, 1991)