PCA Is Not LSI ? IR Thoughts

waterflow 2011-09-13

展开全文

PCA Is Not LSI

By egarcia

The fact that singular value decomposition (SVD) is used in principal component analysis (PCA) and in latent semantic indexing (LSI) has made some (even some “johnnycomeslate-to-IR” assistant professors) to think that PCA is LSI.

The main difference between the two is that in LSI one applies SVD to a term-document matrix, while in PCA one applies SVD to a covariance matrix. So, the starting matrix, A, is different in each case. Consequently, LSI and PCA compute different things.

Now when one looks at the eigenvectors of the V matrix, in LSI these are document vector coordinates, while in PCA these are principal components.

I remember early this year the presentation given by eminent Professor Michael Trosset at IPAM Document Space Workshop (UCLA), Trading Spaces: Measures of Document Proximity and Methods for Embedding Them. He mentioned the difference between PCA and LSI as follows: LSI finds the best linear subspace, while PCA finds the best affine linear subspace. To find the best affine linear subspace, first translate the feature vectors so that their centroid lies at the origin, then find the best linear subspace.

In the middle of his presentation I raised my hand and asked Prof. Trosset why we need to do this transformation. He mentioned that this is done to convert cosine similarities to Pearson product-moment correlation coefficients.

Case closed.

This is a legacy post originally published in 11/8/2006