Classifying Genetic Mutations to Redefine Cancer Treatment

 

Classify the given genetic variations/mutations based on evidence from text-based clinical literature so that the personalized treatment can be provided to the cancer patient based on predicted class probabilities.

Description

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/

Data: Memorial Sloan Kettering Cancer Center (MSKCC)

Download training_variants.zip and training_text.zip from Kaggle.

Context:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

 
Problem statement :

Classify the given genetic variations/mutations based on evidence from text-based clinical literature. 

  1. Apply All the models with tf-idf features (Replace CountVectorizer with tfidfVectorizer and run the same cells)
  2. Instead of using all the words in the dataset, use only the top 1000 words based of tf-idf values
  3. Apply Logistic regression with CountVectorizer Features, including both unigrams and bigrams
  4. Try any of the feature engineering techniques discussed in the course to reduce the CV and test log-loss to a value less than 1.0

Citation: Assignment is given by www.appliedaicourse.com

Real-world/Business objectives and constraints.

  • No low-latency requirement.
  • Interpretability is important.
  • Errors can be very costly.
  • Probability of a data-point belonging to each class is needed.

 

Solution (Python Code Below)

Performance metrics used are below

  • Multi class log-loss
  • Confusion matrix  

Results observations:

 Code on Github Repository - Click here!

 

Previous Post Next Post