Latent Semantic Indexing - Querying and Augmenting LSI Vector Spaces

Querying and Augmenting LSI Vector Spaces

The computed Tk and Dk matrices define the term and document vector spaces, which with the computed singular values, Sk, embody the conceptual information derived from the document collection. The similarity of terms or documents within these spaces is a factor of how close they are to each other in these spaces, typically computed as a function of the angle between the corresponding vectors.

The same steps are used to locate the vectors representing the text of queries and new documents within the document space of an existing LSI index. By a simple transformation of the A = T S DT equation into the equivalent D = AT T S−1 equation, a new vector, d, for a query or for a new document can be created by computing a new column in A and then multiplying the new column by T S−1. The new column in A is computed using the originally derived global term weights and applying the same local weighting function to the terms in the query or in the new document.

A drawback to computing vectors in this way, when adding new searchable documents, is that terms that were not known during the SVD phase for the original index are ignored. These terms will have no impact on the global weights and learned correlations derived from the original collection of text. However, the computed vectors for the new text are still very relevant for similarity comparisons with all other document vectors.

The process of augmenting the document vector spaces for an LSI index with new documents in this manner is called folding in. Although the folding-in process does not account for the new semantic content of the new text, adding a substantial number of documents in this way will still provide good results for queries as long as the terms and concepts they contain are well represented within the LSI index to which they are being added. When the terms and concepts of a new set of documents need to be included in an LSI index, either the term-document matrix, and the SVD, must be recomputed or an incremental update method (such as the one described in ) be used.

Read more about this topic:  Latent Semantic Indexing

Famous quotes containing the words augmenting and/or spaces:

    The true thrift is always to spend on the higher plane; to invest and invest, with keener avarice, that he may spend in spiritual creation, and not in augmenting animal existence. Nor is the man enriched, in repeating the old experiments of animal sensation; nor unless through new powers and ascending pleasures he knows himself by the actual experience of higher good to be already on the way to the highest.
    Ralph Waldo Emerson (1803–1882)

    We should read history as little critically as we consider the landscape, and be more interested by the atmospheric tints and various lights and shades which the intervening spaces create than by its groundwork and composition.
    Henry David Thoreau (1817–1862)