Latent Sematic Analysis or LSA is a way of finding patterns among a collection of documents such as web pages. It is increasingly used by major search engines, such as Google, in ranking websites and determining what AdSense ads to show on a page.
To see how Latent Semantic Analysis works, imagine that you have a collection of documents such as web pages, or in the simple example we show below, book titles. How would you go about finding similarities and differences between the documents?
The Basic Idea
One way is to form a large matrix with each column representing a document, and each row representing a word that has been extracted from the documents. Then, each cell of the matrix is simply the number of times that word appears in that document. For example, if the word “farm” appears in the first document 7 times, then that cell would have a 7 in it. Each cell is simply a count of the number of times that word appears in that document.
The cell numbers are usually massaged, so that whatever patterns are present can be seen more clearly. This step corresponds to “cleaning the data” so that, for example, frequent words are not weighted too heavily, some natural language constructs are simplified, and long documents don’t have an unfair advantage. Some ways that cell numbers are massaged are:
- Log of Counts – the log of the counts in each cell may be used instead of the actual counts.
- Stemming – related words may have the same root word and should be considered the same (such as golf and golfing).
- TF-IDF – (term frequency – inverse document frequency) attempts to measure the importance of a term or word.
- Entropy – another way to measure term importance based on the distribution of the term through documents.
- Normalization – sets each document vector to length 1 so documents with more words don’t have an unfair advantage.