Truncated SVD (The Dimensionality Reduction Algorithm) on Amazon Fine Food Reviews Analysis

Truncated SVD on Amazon Fine Food Reviews Analysis

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review

Apply Truncated-SVD on only this feature set:
- SET 2:Review text, preprocessed one converted into vectors using (TFIDF)
- Procedure:
  - Take top 2000 or 3000 features from tf-idf vectorizers using idf_ score.
  - You need to calculate the co-occurrence matrix with the selected features (Note: X.X^T doesn’t give the co-occurrence matrix, it returns the covariance matrix, check these bolgs blog-1, blog-2 for more information)
  - You should choose the n_components in truncated svd, with maximum explained variance. Please search on how to choose that and implement them. (hint: plot of cumulative explained variance ratio)
  - After you are done with the truncated svd, you can apply K-Means clustering and choose the best number of clusters based on elbow method.
  - Print out wordclouds for each cluster, similar to that in previous assignment.
  - You need to write a function that takes a word and returns the most similar words using cosine similarity between the vectors(vector: a row in the matrix after truncatedSVD)