Demystifying LSA, LSI, SVD, PCA, and IS ACRON...

waterflow 2011-09-13

展开全文

Demystifying LSA, LSI, SVD, PCA, and IS ACRONYMS

By egarcia

If you are interested in learning what the LSA, LSI, SVD, and PCA acronyms mean this post is for you.

Back in 1988, Dumais, Furnas, Landauer, Deerwester and Harshman published the paper Using latent semantic analysis to improve access to textual information. In that paper they proposed latent semantic indexing (LSI) as a new approach for dealing with the vocabulary problem in human-computer interaction. The same year the group presented before the annual meeting of ASIS the paper “Improving information retrieval using Latent Semantic Indexing”.

The LSI acronym was explicitly used througout these articles. Sometimes the authors used the LSA acronym in reference to “latent semantic analysis”. This appears to be a matter of preference since the title of both papers refer to the same approach. Subsequent papers made reference to both acronyms (1 – 6). It can be said that LSI is the analysis of latent semantics in which a specific technique, SVD, is used. This latent semantic is a semantic structure of words to be unveiled or discovered.

The meaning of “latent” in LSI

At certain search marketing forum some are speculating about the meaning of “latent” in LSI. This is an interesting term, especially since SVD uses eigenvalues and these are also referred to as the latent roots of the characteristic expression obtained by solving the determinant of a matrix. So, what is so latent about LSI?

This was addressed by the above authors long ago. In section LATENT SEMANTIC INDEXING (LSI) of their 1988 work (1) they wrote (emphasis added):

“The “latent semantic indexing” (LSI) approach we propose tries to overcome the problems of word-based access by treating the observed word to text-object association data as an unreliable estimate of the true, larger pool of words that could have been associated with each object. We assume there is some underlying “latent” semantic structure in word usage data that is partially obscured by the variability of word choice. We use statistical techniques to estimate this latent structure and get rid of the obscuring “noise”. A description of terms, objects and user queries based on the underlying latent semantic structure (rather than surface level word choice) is used for representing and retrieving information.”

“The particular latent semantic indexing analysis that we have tried uses singular-value decomposition, a technique closely related to eigenvector decomposition and factor analysis”.

What exactly they mean by this “semantic structure”? In abstract of reference 3 they wrote:

“A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.”

And in footnote 1 they stated:

“By “semantic structure” we mean here only the correlation structure in the way in which individual words appear in documents; “semantic” implies only the fact that terms in a document may be taken as referents to the document itself or to its topic.”

Note that the latent semantics they referred to is the correlation structure discovered.

These structures are known in cluster analysis. One can arrive to such structures using other techniques. SVD is just the technique they used in their LSI papers to arrive to specific correlation structures.

LSI and Search Engine Marketers

Many search engine marketers do not really understand what LSI can/cannot do. Some simply have turned the thing into a blogonomy in an attempt to sell services.

Others just read the following and similar statements from the authors as a kind of magic spel, like if LSI has superpowers, overlooking the last portion of the sentence:

“Our job then, in building a retrieval system, is to find some way to predict what terms “really” are implied by a query or apply to a document (i.e. the “latent semantics“) on the basis of the fallible sample actually found there.”

Note that the key here is “on the basis of the fallible sample actually found there.”

If one visualize the entire scheme as a “template” and from the original matrix to be decomposed one selectively and carefully replace specific terms for new terms -keeping everything else intact- and repeat the entire SVD analysis, one should be able to converge to the same overall correlation structure. This should be the case since one merely is dealing with numbers. The statement “on the basis of the fallible sample actually found there” is more than appropriate.

Apply SVD to the wrong or doctored matrix and the latency you get can be the result of garbage in and garbage out.

Talking about other techniques for analyzing term semantics and correlation structures, keep reading our next section.

Latent Semantic Indexing and Information Space

Latent semantic analysis/indexing should not be mistaken for Information Space (IS). At TREC9 Gregory B. Newby, from University of North Carolina, Chapel Hill, presented a nice recap of the two techniques, its similarities and differences. His paper, Information Space based on HTML Structure, is a must-read for IR students and SEOs.

According to Newby,

“The main different between LSI and IS is that LSI utilizes a singular value decomposition (SVD) on the term by document matrix, while IS utilizes principal components analysis (PCA) on the term by term matrix. In both LSI and IS, the distinguishing point from the vector space model (VSM) is that terms are not assumed to be mutually unrelated. The basic process is the same, however: document vectors are computed based on the vectors for terms they contain. A query vector is similarly computed, and the closest documents to the query are retrieved.”

“Although LSI and IS are comparable, and have a similar intellectual heritage in the mathematics of linear algebra, they actually operationalize a significantly different goal. With both LSI and IS, only k columns of the eigenvectors from the SVD or PCA process are used, rather than all N columns for each of the N terms. With LSI, all columns of the eigenvectors would in fact result in a vector space in which all terms are mutually orthogonal – in other words, the same fundamental model of the VSM. Thus, the k-dimensional vector space representing term relations in LSI is an approximation of an orthogonal term space. By reducing k, LSI attempts to account for assumed “errors” in the original term by document matrix.”

“With IS, all N columns of the eigenvectors would result in a vector space in which term relations are identically scaled to the numeric relations among terms in the original term by term input co-occurrence matrix. Thus, the k-dimensional vector space representing term relations in IS is an approximation of the relations among terms actually measured in the term by term matrix.”

Final Remarks

In addition to differences in what the k-dimensional vector space represents, note that LSI uses SVD while IS uses PCA, which involves analysis of covariance and computing first and second principal components. I am working on a tutorial on PCA.

Like many similarity measures, SVD and PCA are standard techniques used in statistical and clustering analysis; IR folks did not invent the techniques, as many want to believe. Essentially they applied these as ancilliary techniques to solve specific information retrieval problems.

For those interested in how IS was used in TREC9, let me mention that the entry focused on analyzing co-ocurrence matrices and was made out of specific HTML tags; i.e. the TITLE, H1, H2, and H3 tags. This was done to reduce the co-occurrence matrix and facilitate early precision, by matching the short TREC topic title or title plus description statements to terms in these tags.

http://irthoughts./2007/05/05/on-svd-and-pca-some-applications/

References

Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988) Using latent semantic analysis to improve access to textual information; Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286.
Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Beck, L. (1988). Improving information retrieval using Latent Semantic Indexing. Proceedings of the 1988 annual meeting of the American Society for Information Science.
Deerwester, S., Dumais, S. T. ,Furnas, G. W., Landauer, T. K., & Harshman, R. Indexing by Latent Semantic Analysis JASIS (1990).
Dumais, S. T. (1989). Enhancing performance in LSI (Latent Semantic Indexing) Retrieval. Bellcore Technical memo.
Fischer, G., Foltz, P. W., Kintsch, W., Neiper-Lemke, H., & Stevens, C. Personal information systems and models of human memory.
Using Latent Semantic Indexing for Information Filtering
Latent Semantic Analysis

This is a legacy post originally published in 7/10/2006