Classify the given genetic variations/mutations based on evidence from text-based clinical literature so that the personalized treatment can be provided to the cancer patient based on predicted class probabilities.
Description
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/
Data: Memorial Sloan Kettering Cancer Center (MSKCC)
Download training_variants.zip and training_text.zip from Kaggle.
Context:
Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462
Problem statement :
Classify the given genetic variations/mutations based on evidence from text-based clinical literature.
- Apply All the models with tf-idf features (Replace CountVectorizer with tfidfVectorizer and run the same cells)
- Instead of using all the words in the dataset, use only the top 1000 words based of tf-idf values
- Apply Logistic regression with CountVectorizer Features, including both unigrams and bigrams
- Try any of the feature engineering techniques discussed in the course to reduce the CV and test log-loss to a value less than 1.0
Citation: Assignment is given by www.appliedaicourse.com
Real-world/Business objectives and constraints.
- No low-latency requirement.
- Interpretability is important.
- Errors can be very costly.
- Probability of a data-point belonging to each class is needed.
Solution (Python Code Below)
Performance metrics used are below
- Multi class log-loss
- Confusion matrix
Results observations:
+--------------------------------------------------+--------------------+ | Model | MulticlassLogLoss | +--------------------------------------------------+--------------------+ | Naive Bayes | 1.1638163227108387 | | k-NN | 1.2423881536416765 | | LogisticRegression - Class Balancing | 0.9781521813338006 | | LogisticRegression - no Class Balancing | 1.0067362213864928 | | Linear SVM | 1.021296887329438 | | Random Forest | 1.1125836846681483 | | StackingClassifier | 1.0409334855214087 | | Maximum VotingClassifier | 1.0962373174694713 | | LR CountVectorizer Features - class balancing | 1.1163326974683052 | | LR CountVectorizer Features - no class balancing | 1.1416945419013773 | +--------------------------------------------------+--------------------+
- For first 8 models in above table, tf-idf vectorization is used, For last 2 models listed in above table uses BoW vectorization.
- #### The logistic regression model with class balancing where TF-IDF
vectorization is used is the winner having multiclass log loss of
0.9781 and we are able to reduce log loss below 1
Code on Github Repository - Click here!