分享
SVD and LSI Tutorial 1: Understanding SVD and LSIA tutorial on Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI), its advantages, applications and limitations. Covers LSI myths and misconceptions from search engine marketers.Dr. E. Garcia Topics About this TutorialSearch Engine Marketers and their LSI MythsAll Extremes are Bad AdvicesHow this Tutorial is OrganizedBackgroundOn Dimensionality Reduction, LSI and FractalsSVD and LSI ApplicationsA Geometrical VisualizationSummaryTutorial ReviewReferencesAbout this TutorialI wrote this tutorial to:
Search Engine Marketers and their LSI MythsAs with many IR topics, LSI is a subject that from time to time surfaces in the search engine marketing industry through forums, conferences and events. Often, these discussions are limited to partially quoting or interpreting IR papers and patents. At the time of writing this industry doesn't provide its members with stepwise howto instructions for implementing LSI even when most of the information is available online. Consequently search marketers don't understand LSI. In particular, they don't seem to grasp the advantages and limitations of the technique, what is/is not LSI or what this can or cannot do for them or their clients. The result is the dissemination of inaccurate information. For instance, some marketers have assigned a meaning to the terms "latent" and "semantic" that is not in the LSI literature. Others have become "experts" at quoting each other hearsays. In an effort to sell their services, even others have come with "LSIbased" software, videos, "lessons", tools, etc., that are at best a caricature of how a search engine or IR system implements LSI. Whatever these tools score probably is not what a search engine like Google or Yahoo might be scoring. Some of these marketing companies even display a tag cloud of words and try to sell the idea that they have a real or unique "LSI technology". Such clouds are easy to construct and link to search result pages. These can be constructed from any lookup list, thesaurus or search log files. No SVD is needed. In an effort to save face and avoid litigation from consumers, some of these purveyors of falsehood as other crooks and their friends play with words and call theirs "LSIlike", "LSIbased", "LSIdriven" technology or use similar snaky phrases. The funny thing is that other SEOs, bloggers, and marketers fall for these tactics. And how to forget the "LSI and Link Popularity" half lies and half crap promoted by those that offer linkbased services? As usual, SEO bloggers repeat such hearsays like parrots since most of these don't really know how to SVD a simple matrix. Since there is now a crew of search marketing firms claiming to sell all sort of LSIbased SEO services and making a profit out of the ignorance of consumers, I am making public my case against these firms in the post Latest SEO Incoherences (LSI)  My Case Against "LSI based" Snakeoil Marketers Stay away from such marketing firms, their claims and businesses. By providing an SVDLSI tutorial, complete with stepbystep howto calculations and examples, I hope to put to rest the many myths and misquotes disseminated by these search marketers. Here is a list of the most common misconceptions: Latent Semantic Indexing (LSI) ...
This list of misconceptions, myths or plain lies was recopilated from SEOBook, SearchEngineWatch, Cre8asiteforums, SEOChat, SEOMOZ, SeoRoundTable, Webmasterworld and similar forums. A sample of the Latest SEO Incoherences ("LSI") is available online. This spreading of incorrect knowledge through electronic forums gives rise to a bursting phenomenon that in the past we have referred to as blogonomies. In our view knowledge, citation importance or link weight transmitted through such bursts can be considered corrupted. Thus, this tutorial pretends to dispel search marketing blogonomies relevant to SVD and LSI. The fact is that query operators are not part of LSI. In addition to English text, LSI has been applied to text in Spanish and to text in other languages. LSI makes no presumptions regarding words in documents or queries whether these are or should be nouns, verbs, adjectives or other form of tokens. LSI is not ontopic analysis or what SEOs like to call "theming". Current LSI algorithms ignore word order (term sequences), though a Syntagmatic Paradigmatic Model and Predication Algorithm has been proposed to work around this. Another misconception is that latent semantic is cooccurrence. Actually is not; at least, not firstorder cooccurrence. LSI works great at identifying terms that induce similarity in a reduced space, but research from Dr. Tom Landauer and his group at the University of Colorado (19) indicates that over 99 % of wordpairs whose similarity is induced never appear together in a paragraph. Readers should be reminded that synonyms or terms conveying a synonymity association don't tend to cooccur, but tend to occur in the same, similar or related context. While LSI itself is not cooccurrence, term cooccurrence is important in LSI studies. A persistent myth in search marketing circles is that LSI grants contextuality; i.e., terms occurring in the same context. This is not always the case. Consider two documents X and Y and three terms A, B and C and wherein: However, only because terms A and B are intransit with C this does not grant contextuality, as the terms can be mentioned in different contexts in documents X and Y. For example, this would be the case of X and Y discussing different topics. Long documents are more prone to this. Even if X and Y are monotopic these might be discussing different subjects. Thus, it would be fallacious to assume that highorder cooccurrence between A and B while intransit with C equates to a contextuality relationship between terms. Add polysemy to this and the scenario worsens, as LSI can fail to address polysemy. There are other things to think about. LSI is computationally expensive and its overhead is amplified with largescale collections. Certainly LSI is not associative indexing or root (stem) indexing like some have suggested. It is not document indexing, but used with already indexed collections whose document terms have been prescored according to a particular term weight scheme. Furthermore, understanding a query; i.e., the assumption that the query must be of the natural language type, is not a requirement for implementing LSI. In addition, the claim that terms must come from a specific portion of a document like title tags, anchor text, links or a specific url domain plays no role and is not a requisit for implementing LSI. These false concepts have been spreaded for a while, mostly by those that sell linkbased services, who conveniently don't provide mathematical evidence on how LSI works since they cannot do the math. True that some papers on largescale distributed LSI mentions the word "domain" in connection with LSI, but the term is used in reference to information domains, not url domains or what is known as web sites. True that LSI can be applied to collections that have been precategorized by web site domains, but this is merely filtering and preclassification and is not part of the SVD algorithm used in LSI. Let me mention that the technique of singular value decomposition used in LSI is not an AI algorithm, but a matrix decomposition technique developed in the sixties; though SVD has been used in many environments, including AI (114). Roughly speaking, SVD itself is just one matrix decomposition technique. Certainly there are more than one way of decomposing and analyzing a given matrix. Plenty of alternate techniques are available online (e.g., LU, QR, etc.). True that SVD as NMF (nonnegative matrix factorization) has been used to conduct email forensics. True that SVD has been used as an eavesdropping tool for identifying word patterns from web communities (1518), but LSI is not a secret weapon from the Government designed to read your mind at least not yet. :). Another misconception is that LSI is CIRCA, a technology developed by Applied Semantics (acquired by Google). As mentioned at this IR Thoughts blog, this is another SEO blogonomy. CIRCA is based on ontologies, not on SVD. LSI is not based on ontologies, but on SVD. When you think thoroughly There is No Such Thing as "LSIFriendly" Documents. This is just another SEO Myth promoted by certain search engine marketing firms to market better whatever they sell. In the last tutorial of this series (SVD and LSI Tutorial 5: LSI Keyword Research and CoOccurrence Theory), we explain in details why there is not such thing as "LSIFriendly" documents and why SEOs cannot use LSI to optimize for ranking purposes any document. All Extremes are Bad AdvicesLast, but not least, while marketers should not lose any sleep over LSI, they should not go to the other extreme and advice others to brush off the whole thing or ignore the power of language manipulations as some have suggested. A recent neuropsychology study (11/28/2006) presented at the Annual Meeting of the Radiological Society of North America found that consumer's brain activities can be conditioned even before they put a foot on a store. This occurs by associating brands to keywords, which requires of proper language manipulation techniques. The study found popular brands activated parts of the brain linked to selfidentity and reward. The findings help understand how the brain perceives and processes brands, aiding marketing initiatives. FMRI revealed that wellknown brands, regardless of the product, activated parts of the brain associated with positive emotional processing, selfidentity and reward. Less wellknown brands activated parts of the brain associated with negative emotional response. Lead researcher Dr. Christine Born, a radiologist at University Hospital, part of the LudwigMaximilians University in Munich, Germany said: "This is the first functional magnetic resonance imaging test examining the power of brands", "We found that strong brands activate certain areas of the brain independent of product categories."..."The vision of this research is to better understand the needs of people and to create markets which are more oriented towards satisfaction of those needs." Marketing psychologist Paul Buckley, from the University of Wales Institute, Cardiff, said: "Marketing is all about learning by association  companies constantly push the same image over and over again from a range of media. So people associate a famous brand with positive imagery, and you would expect that positive imagery to trigger off positive emotional responses." In the December 2006 issue of IR Watch  The Newsletter we covered these studies in details for our subscribers. Also, in the January 2007 issue we explained why marketers that dismiss the power of language manipulations don't really understand consumer's behaviors and how these are conditioned. If you are a subscriber read the complete reports in: IR Watch 20064: CIndices and KeywordBrand Associations How this Tutorial is OrganizedThis tutorial is organized as follows. Part I is an introduction to SVD and LSI you are reading it right now. All parts include plenty of illustrations, examples and exercises. Assumptions and Prerequisites I assume readers have a linear algebra background or understand the material covered in:
You may skip these tutorials if you are familiar with matrix decomposition. Most of the basic matrix operations and terminology to be used are explained or covered in these tutorials. If you are not familiar with linear algebra or have not read the tutorials, please STOP AND READ THESE. The teaching technique that I will be using is the one described in previous tutorials and that pretends to mimics tf*IDF schemes. That is, global knowledge is presented first and then associated to local, more specific knowledge. In addition, instead of lecturing about SVD I want to show you how things work step by step. So, grab a pencil and a stack of papers. After manually computing SVDs for small matrices, you might want to use software to double check results. With large matrices (of order greater than 3) you might want to use a software package like MatLab or an open source version of this package like SciLab, which is a free download. For relatively small matrices I recommend the use of matrix calculators like Bluebit and its Matrix ActiveX Component. You can use this component in your SVD or LSI project to impress others. To showcase on the Web the knowledge you are about to acquire, I recommend you to use JavaScript utilities like this Singular Value Decomposition Calculator. Be aware that some of these tools come with their own learning curves. Grab your favorite drink and let's the fun begin. BackgroundIn 1965 G. Golub and W. Kahan introduced Singular Value Decomposition (SVD) as a decomposition technique for calculating the singular values, pseudoinverse and rank of a matrix (1). The conventional way of doing this was to convert a matrix to a rowecholon form. The rank of a matrix is then given by the number of nonzero rows or columns of the echolon form, whichever of these two numbers is smaller. SVD is an entirely different approach. The technique decomposes a matrix A into three new matrices Equation 1: A = USV^{T} where U is a matrix whose columns are the eigenvectors of the AA^{T} matrix. These are termed the left eigenvectors. The decomposition not only provides a direct method for computing the rank of a matrix, but exposes other equally interesting properties and features of matrices. On Dimensionality Reduction, LSI and FractalsWhen computing the SVD of a matrix is desirable to reduce its dimensions by keeping its first k singular values. Since these are ordered in decreasing order along the diagonal of S and this ordering is preserved when constructing U and V^{T}, keeping the first k singular values is equivalent to keeping the first k rows of S and V^{T} and the first kcolumns of U. Equation 1 reduces to Equation 2: A* = U* S* V^{T}* This process is termed dimensionality reduction, and A* is referred to as the Rank k Approximation of A or the "Reduced SVD" of A. This is exactly what is done in LSI. The top k singular values are selected as a mean for developing a "latent semantics" representation of A that is now free from noisy dimensions. This "latent semantics" representation is a specific data structure in lowdimensional space in which documents, terms and queries are embedded and compared. This hidden or "latent" data structure is masked by noisy dimensions and becomes evident after the SVD. This discovery of "latent" data structures (or masked information) from a system can be achieved also when SVD is applied to address problems and systems that have nothing to do with documents, queries, textual data or word semantics at all. Example of these are systems encountered in Biology, Chemistry, Engineering, Medical Sciences and other Applied Sciences. A search in Google reveals that SVD is a well known chemometric technique that even has been applied to HPLC Chemistry (high performance or pressure liquid chromatography). This is not surprising. In a sense, dimensionality reduction is a noise reduction process. Thus, SVD belongs to a class of dimensionality reduction techniques that deal with the uncovering of latent data structures. Other techniques for the discovery of latency have been developed. These compare favorable with SVD. An example of this is NonNegative Matrix Factorization (NMF) (15, 17). Dimensionality reduction is somewhat an arbitrary process. How many k dimensions to keep can lead to the socalled "dimensionality reduction curse" in which performance is affected. To deal with this "curse" a workaround based on Fractal Geometry have been proposed. For instance, Kumaraswamy has suggested that the lost of performance of a method in LSI, using dimensionality reduction and in data storage, using vector quantization can be related to the fractal dimension of the data set under consideration (2). This and other workarounds (19) for dealing with the "dimensionality reduction curse" are out of the scope of this tutorial. SVD and LSI ApplicationsSince its introduction more than 40 years ago SVD has become a standard decomposition technique in linear algebra. It is a great technique for uncovering hidden or "latent" data structures while removing noise. Many problems encountered in Engineering and Sciences have been addressed with SVD (3  7). In 1988 Deerwester et. al. used SVD to deal with the vocabulary problem in humancomputer interaction and called their approach Latent Semantic Indexing (LSI) (8, 9), known also as LSA (latent semantic analysis). Thus, LSI is one application of SVD in the same way that IS (Information Space) is only one application of PCA (Principal Component Analysis), a technique often mistaken for LSI. The LSI, SVD, PCA and IS acronyms have been discussed in our IR Thoughts blog (10). LSI has several bottlenecks associated with storage considerations, difficulties in upgrading the underlying database, and other speed and reliability considerations. Since both SVD and LSI are computationally expensive, LSI is not yet a practical solution for large commercial collections that must return results under less than a second. Even when Telcordia has proposed a Distributed LSI approach based on partitioning collections into conceptual information domains, it remains to be seen if such approach is feasible with a commercial environment full of marketing noise and vested interests (11). On the Web, SVD and LSI have been used by search engines as auxiliary techniques on representative samples or small collections that have been pretreated. An example of this is Yahoo! payforperformance paid collection (12). It remains to be seen if advances in the field and better architectures make LSI suitable for search engine collections consisting of billion of documents or for searchable collections of any size. LSI is great at addressing synonymy (different words with same meaning), but can fail to address polysemy (same word with different meanings). A solution that combines LSI with SI (synchronic imprints) has been proposed (13). A Yahoo! Research group (14) has proposed an improved algorithm known as VLSI (Variable Latent Semantic Indexing). Now that we have a brief background on SVD and LSI, let's discuss how singular value decomposition works. The best way to grasp this is through a visual example. A Geometrical VisualizationThe geometrical interpretation of SVD has been described elsewhere. I'm going to use a simple analogy. Depicts an arrow rotating in two dimensions and describing a circle. See Figure 1. Figure 1. A rotating arrow of fixed length describing a circle. Now suppose that when rotating this arrow we stretch and squeeze it in a variable manner and up to a maximum and minimum length. Instead of a circle the arrow now describes a twodimensional ellipsoid. Figure 2. A rotating arrow of variable length describing an ellipsoid. This is exactly what a matrix does to a vector. When we multiply a vector X by a matrix A to create a new vector AX = Y, the matrix performs two operations on the vector: rotation (the vector changes coordinates) and scaling (the length of the vector changes). In terms of SVD, the maximum stretching and minimum squeezing is determined by the singular values of the matrix. These are values inherent, unique or "singular" to A that can be uncovered by the SVD algorithm. In the figure the effect of the two largest singular values, s_{1} and s_{2} has been labeled. In Part 2 of this tutorial we will be computing these values. For now, the important thing to remember is this: singular values describe the extent by which the matrix distorts the original vector and, thus, can be used to highlight which dimension(s) is/are affected the most by the matrix. Evidently, the largest singular value has the greatest influence on the overall dimensionality of the ellipsoid. In geometric terms Figure 1 and Figure 2 show that multipliying a vector by a matrix is equivalent to tranforming (mapping) a circle into an ellipsoid. Now instead of a rotating vector depicts a set of mdimensional vectors, all perpendiculars (orthogonal), with different lengths and defining a mdimensional ellipsoid. The ellipsoid is embedded in an mdimensional space, where m = k, with k being the number of nonzero singular values. If we eliminate dimensions by keeping the three largest singular values, a threedimensional ellipsoid is obtained. This is a Rank 3 Approximation. Now suppose that compared with the two largest singular values the third one is small enough that we ignore it. This is equivalent to removing one dimension. Geometrically, the 3dimensional ellipsoid collapses into a 2dimensional ellipsoid, and we obtain a Rank 2 Approximation. Obviously, if we keep only one singular value the ellipsoid collapses into a onedimensional shape (a line). Figure 3 illustrates roughly the reduction. Figure 3. Dimensionality Reduction. This is why we say that keeping the top k singular values is a dimensionality reduction process. The main idea is this: singular values can be used to highlight which dimensions are affected the most when a vector is multiplied by a matrix. Thus, the decomposition allows us to determine which singular values can be retained, from here the expression "singular value decomposition". In the case of LSI, SVD unveils the hidden or "latent" data structure, where terms, documents, and queries are embedded in the same lowdimensional and critical space. This is a significant departure from term vector models. In the vector space model each unique term from an index term is considered a dimension of a term space. Documents and queries are then represented as vectors in this term space. In this space terms are considered independent from each other. However, this is contrary with common knowledge since term dependency can arise in written or spoken language in many different ways. The most common are: through (a) synonymity and (b) polysemy. LSI performs better than vector space models by embedding documents, terms and queries in the same space. This resolves the problem of highorder cooccurrence like intransit cooccurrence. Intransit cooccurrence accounts for the fact that terms do not need to be actually in a document to be relevant to its content. They can cooccur while "in transit". An example of this are synonyms. Synonyms don't tend to cooccur together (often they have both low pairwise cindices and low Jaccard Indices from a cluster association matrix) but they tend to cooccur in the same context this can be inspected by measuring their highorder cindices. It should be pointed out that "intransit" cooccurrence is not unique to synonyms. Terms not related by a synonymity association can be intransit and their high order cooccurrence can be examined with cindices. This is illustrated in Figure 4. The Venn Diagrams represent documents relevant to specific terms. The (dog, canine) and (canine, molar) pairs have a synonymity association, but this association is not present in the (dog, molar) pair or between the other pairs. The figure shows also that intransit cooccurrence not only can be observed between individual tokens, but between distinct contexts. Figure 4. Intransit cooccurrence between tokens and contexts. To simplify, in the figure we have ignored other forms of cooccurrences between ngrams like multiple overlapping regions, which can actually give rise to an LSI "curse". While LSI addresses synonyms, it can fail to address polysems (terms with different meanings) since dimensionality reduction can incorrectly conflate critical dimensions. So the popular opinion that LSI perfectly addresses synonyms and polysems is incorrect. Selection of k values and the right cluster is done arbitrarily, bytrialanderror, and as such might become a "curse". Several workarounds have been proposed, but I prefer not to discuss these at this time. SummaryIn this part of the tutorial we have introduced SVD and LSI. Several myths and misconceptions regarding LSI have been mentioned. Some limitations, advantages and applications of SVD and LSI have been discussed. A geometrical visualization for interpreting the effect of a matrix on a vector has been provided. It is now time to go over the mathematical basis and justifications of the SVD technique. This will be covered in Part II and III of the tutorial. In particular we want to show you how is that singular values are computed and the stepwise howtocalculations leading to the decomposition, reconstruction and approximation of matrices. Next: SVD and LSI Tutorial 2: Computing Singular ValuesTutorial Review
References

