Truncated SVD can be used as one of the DIMENSIONALITY REDUCTION algorithm like PCA (Principal Component Analysis)
Truncated SVD on Amazon Fine Food Reviews Analysis
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews
EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:
- Id
- ProductId - unique identifier for the product
- UserId - unqiue identifier for the user
- ProfileName
- HelpfulnessNumerator - number of users who found the review helpful
- HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
- Score - rating between 1 and 5
- Time - timestamp for the review
- Summary - brief summary of the review
- Text - text of the review
Objective:
- Apply Truncated-SVD on only this feature set:
- SET 2:Review text, preprocessed one converted into vectors using (TFIDF)
- Procedure:
- Take top 2000 or 3000 features from tf-idf vectorizers using idf_ score.
- You need to calculate the co-occurrence matrix with the selected features (Note: X.X^T doesn’t give the co-occurrence matrix, it returns the covariance matrix, check these bolgs blog-1, blog-2 for more information)
- You should choose the n_components in truncated svd, with maximum explained variance. Please search on how to choose that and implement them. (hint: plot of cumulative explained variance ratio)
- After you are done with the truncated svd, you can apply K-Means clustering and choose the best number of clusters based on elbow method.
- Print out wordclouds for each cluster, similar to that in previous assignment.
- You need to write a function that takes a word and returns the most similar words using cosine similarity between the vectors(vector: a row in the matrix after truncatedSVD)
