Introduction to Latent Semantic Indexing for Text via Singular Value Decomposition

cuML + Latent Semantic Indexing (LSI)

Latent semantic indexing (LSI) is the process of extracting and analyzing documents in order to create a representation that captures the similarity of words and documents. LSI assumes that words with similar meaning will occur under similar context, and therefore, the process of LSI utilizes Singular Value Decomposition (SVD) to reduce the dimensionality of the entire word usage representation of all the documents. This allows for documents that are semantically similar but utilize different words to be re-represented as more similar in the reduced vector space. We recommend the following video introduction of latent semantic indexing (LSI) and Singular Value Decomposition (SVD).

cuML is RAPIDS' suite of GPU-accelerated machine learning algorithms that mirror's sklearn's API. The entire suite of what cuML offers can be found here.

Let's first import the data we will be using for this lab. We will be using sklearn's 20newsgroups dataset. You can read the documentation here. When fetching the training and testing data, make sure to remove headers, footers, and quotes.

Let's see what categories the newsgroup documents fall under.

Now we have to turn our newsgroup documents into a representation that we can run SVD on. For that, we can utilize TFIDF, or term frequency-inverse document frequency. TFIDF extracts information from a document through counting the number of occurrences of each word in the document and then scaling down the impact of words that appear very frequently. The scaling prevents words that appear very frequently and often times provide little information, such as "and", "so", and "the", from becoming significant words in a document. More of TFIDF can be read here.

We will use cuML's TFIDF vectorizer. The API for TDIDF can be found within cuML's documentation. Fit the vectorizer with the training data and transform both the training and testing data. Make sure to convert the transformed data into numpy arrays.

Display our test data as a pandas dataframe. Don't forget to add the column name for pandas.

There are quite a few features in our bag of words, more than 100 thousand! We will cut it down a bit to help our runtime in the following steps using sklearn's SelectPercentile function. SelectPercentile selects the top features of a dataset which allows us to discard features which are not as important. Be sure to fit the model on our training data and the training data's target, and don't forget to transform needed datasets.

After applying our SelectPercentile function, we have removed 90% of the initial features. The column headers will need to be updated to reflect the selected features (hint: look at the methods available to SelectPercentile).

Let's take a look at our new data.

In order to run SVD on our vectorized training data, we will need to use cuML's TruncatedSVD API. Set the number of components to be 100, the number of iterations to be 25, and the algorithm to be Jacobi. Then fit the model with the training dataset and then transform the training dataset. Documentation on cuML's TruncatedSVD can be found here.

After fitting our SVD model, let's visualize our singular values. Plot the singular values in descending order.

Let's take a look at what words are associated with each component. Let us use the 4th most significant component for this example (index 3). Remember that the larger the component's associated singular value, the more significant the component, and the words associated with each component can be found in the components_ of TruncatedSVD. Print the top 25 words in the 4th component. Finding the top words for each component can be achieved through sorting the component from greatest to least and identifying the ordering of the features in the sorted arrangement (hint, use argsort).

Now let take the 501st datapoint (index 500) from our testing dataset. We will set this as our "search" document. We will find the document in the training set that is related closest to this document, but first, let us take a look at the content of this document and its classification.

We will now reduce then transform our search document into its component form to compare it with our reduced and transformed training set. You can reduce the datapoint to fit our training set by transforming with our previous SelectPercentile model. You will then need to transform the new document into a representation we can use to compare to the existing documents. Because our training documents have been transformed from containing features to containing components, we will need convert our search document to have the same representation. We can do this by performing a dot product on the TFIDF representation of our search document with our SVD components (be sure the dimensions match).

To find the document in our training set most similar to our choosen document. We will need to run a cosine similarity function between our transformed search document and the transformed training set. Once you have the cosine similarity values, order the training set documents from most to least similar to our search document.

What are the contents of the top 3 most similar documents in the training set? What categories do they fall under? Does they look similar to our initial search document?