分享

PCA Is Not LSI ? IR Thoughts

 waterflow 2011-09-13

PCA Is Not LSI

By egarcia

The fact that singular value decomposition (SVD) is used in principal component analysis (PCA) and in latent semantic indexing (LSI) has made some (even some “johnnycomeslate-to-IR” assistant professors) to think that PCA is LSI.

The main difference between the two is that in LSI one applies SVD to a term-document matrix, while in PCA one applies SVD to a covariance matrix. So, the starting matrix, A, is different in each case. Consequently, LSI and PCA compute different things.

Now when one looks at the eigenvectors of the V matrix, in LSI these are document vector coordinates, while in PCA these are principal components.

I remember early this year the presentation given by eminent Professor Michael Trosset at IPAM Document Space Workshop (UCLA), Trading Spaces: Measures of Document Proximity and Methods for Embedding Them. He mentioned the difference between PCA and LSI as follows: LSI finds the best linear subspace, while PCA finds the best affine linear subspace. To find the best affine linear subspace, first translate the feature vectors so that their centroid lies at the origin, then find the best linear subspace.

In the middle of his presentation I raised my hand and asked Prof. Trosset why we need to do this transformation. He mentioned that this is done to convert cosine similarities to Pearson product-moment correlation coefficients.

Case closed.

This is a legacy post originally published in 11/8/2006

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多