Data Analysis on Classifying the Severity of Genetic Mutations
¶

By: Kelly Pleiman, Advisor: Dr. Ying-Ju Chen

computational%20biology%20pic.jpg

Overview
¶

The goal of my Capstone was to find a computational model to classify the severity of genetic mutations using the gene the mutation is on, the kind of variation, and the clinical evidence used to classify the mutation.

Data Obtained From a Kaggle.com Research Prediction Competion Included:

  1. Gene- the gene the genetic mutation was on (Ex. CBL, RUNX1)
  2. Variation- shows the specific amino acid change that caused the mutation (Ex. Q249E means glutamic acid (E) was placed as the amino acid in the 249 codon instead of glutamine (Q))
  3. Text- clinical evidence in the form of text data
  4. Class- the severity of the mutation from 1-9

Original Data
¶

The original data came in two files:

  1. Variable data on the gene, variation, and class for each mutations ID
  2. Text data that the corresponds to each mutation ID

Variable Data variables.png

Text Data text%20pic.png

Importing Data
¶

Step one was using the pandas package to import the original variable data and text data into Jupyter Notebook.

Variable Data Importation Code

In [45]:
import pandas as pd 

#import the variable train data
trainVariants=pd.read_csv("training_variants.csv") 

#display trainVariants
trainVariants.head()
Out[45]:
ID Gene Variation Class
0 0 FAM58A Truncating Mutations 1
1 1 CBL W802* 2
2 2 CBL Q249E 2
3 3 CBL N454D 3
4 4 CBL L399V 4

Text Data Importation Code

In [46]:
#import the train text data file 
trainText=pd.read_csv("training_text.txt", engine="python", sep='\|\|', skiprows=1, names=["ID", "Text"]) 

#display trainText
trainText
Out[46]:
ID Text
0 0 Cyclin-dependent kinases (CDKs) regulate a var...
1 1 Abstract Background Non-small cell lung canc...
2 2 Abstract Background Non-small cell lung canc...
3 3 Recent evidence has demonstrated that acquired...
4 4 Oncogenic mutations in the monomeric Casitas B...
... ... ...
3316 3316 Introduction Myelodysplastic syndromes (MDS) ...
3317 3317 Introduction Myelodysplastic syndromes (MDS) ...
3318 3318 The Runt-related transcription factor 1 gene (...
3319 3319 The RUNX1/AML1 gene is the most frequent targe...
3320 3320 The most frequent mutations associated with le...

3321 rows × 2 columns

Cleaning Text Data
¶

Next the text data was cleaned, necessary for later encoding. The following cleaning function:

  • gets rid of any special characters (e.g. *, /, - )
  • gets rids of any double spaces
  • gets rid of frequently occuring words to simplify (e.g. the, and, or)

The function was then applied to every text cell.

Function to clean each cell of text:

In [48]:
#create a function that takes a cell string in the dataframe and cleans that cell (lowercase, no special characters or spacing) 
def clean(cellString):   
    
    #convert to lower case 
    cellString= cellString.lower() 
    
    #convert any special characters to a space 
    cellString = re.sub('[^a-zA-Z0-9\n\.]', ' ', cellString) 
    
    #convert any double spaces to a single space  
    cellString=re.sub('\s+', ' ', cellString)   
    
    #get rid of stop words (aka frequently occuring words)   
    stop_words= set(stopwords.words('english'))
    cleanedSentence=""
    for word in cellString.split():
        if not word in stop_words: 
            cleanedSentence= cleanedSentence+ str(word)+ " "
        
    return cleanedSentence

Applied Function to Every Cell

In [49]:
%%capture output
#run the function on each row of text data 
i=0  
while i<=3320: 
    
    #get cell value 
    value=trainText["Text"][i]  
    value=str(value)
    
    #if the value is a str, then clean the string 
    if len(value)!=0:
        value=clean(value)
        trainText.Text[i]=value  

    #update i 
    i=i+1

Final Cleaned Text Data

In [50]:
trainText
Out[50]:
ID Text
0 0 cyclin dependent kinases cdks regulate variety...
1 1 abstract background non small cell lung cancer...
2 2 abstract background non small cell lung cancer...
3 3 recent evidence demonstrated acquired uniparen...
4 4 oncogenic mutations monomeric casitas b lineag...
... ... ...
3316 3316 introduction myelodysplastic syndromes mds het...
3317 3317 introduction myelodysplastic syndromes mds het...
3318 3318 runt related transcription factor 1 gene runx1...
3319 3319 runx1 aml1 gene frequent target chromosomal tr...
3320 3320 frequent mutations associated leukemia recurre...

3321 rows × 2 columns

Merging and Encoding Data
¶

The next step was merging the variable data and text data on corresponding ID. This allows for it all to be processed at the same time.

Merging Data Code

In [51]:
#merge the trainVariant and cleaned trainText data 
data=pd.merge(trainVariants, trainText, how="outer", on=["ID"]) 

#display the merged data 
data.head()
Out[51]:
ID Gene Variation Class Text
0 0 FAM58A Truncating Mutations 1 cyclin dependent kinases cdks regulate variety...
1 1 CBL W802* 2 abstract background non small cell lung cancer...
2 2 CBL Q249E 2 abstract background non small cell lung cancer...
3 3 CBL N454D 3 recent evidence demonstrated acquired uniparen...
4 4 CBL L399V 4 oncogenic mutations monomeric casitas b lineag...

Encoding Data¶

After the files were merged, the categorical data had to be encoded into numerical data using pythons Label Encoding package. Python models are only able to process numerical data making this necessary.

In [52]:
#encode our data to numerical values so we are able to run models on it 
from sklearn.preprocessing import LabelEncoder 

#encode gene, variation, and text 
geneEncoder=LabelEncoder() 
variationEncoder= LabelEncoder() 
textEncoder=LabelEncoder()  

#create a column for each of these new encodings in the data  
data["geneEn"]=geneEncoder.fit_transform(data["Gene"]) 
data['variationEn']=variationEncoder.fit_transform(data['Variation']) 
data['textEn']=textEncoder.fit_transform(data["Text"])

Merged and Encoded Data

In [53]:
data
Out[53]:
ID Gene Variation Class Text geneEn variationEn textEn
0 0 FAM58A Truncating Mutations 1 cyclin dependent kinases cdks regulate variety... 85 2629 532
1 1 CBL W802* 2 abstract background non small cell lung cancer... 39 2856 36
2 2 CBL Q249E 2 abstract background non small cell lung cancer... 39 1897 36
3 3 CBL N454D 3 recent evidence demonstrated acquired uniparen... 39 1667 1557
4 4 CBL L399V 4 oncogenic mutations monomeric casitas b lineag... 39 1447 1322
... ... ... ... ... ... ... ... ...
3316 3316 RUNX1 D171N 4 introduction myelodysplastic syndromes mds het... 221 306 970
3317 3317 RUNX1 A122* 1 introduction myelodysplastic syndromes mds het... 221 28 968
3318 3318 RUNX1 Fusions 1 runt related transcription factor 1 gene runx1... 221 807 1642
3319 3319 RUNX1 R80C 4 runx1 aml1 gene frequent target chromosomal tr... 221 2249 1646
3320 3320 RUNX1 K83E 4 frequent mutations associated leukemia recurre... 221 1333 702

3321 rows × 8 columns

Create Train and Test Sets
¶

  • Train: Allow the model to run its algorithm on a random 80% of our data
  • Test: apply this same model to the remaining 20% of the data and check performance

Data Splits:

  1. Train Input- gene, variation, and text from 80% of data
  2. Train Output- corresponding class from same 80%
  3. Test Input- gene, variation, text from remainging 20% of data
  4. Test Output- corresponding class from remaining 20%

Input and Output Code

The input does not include class which is the desired output of the model.

In [56]:
#drop Class and non-encoded inputs 
Input=data.drop(["Gene", "Variation", "Text", "Class"], axis='columns')

#just numerical class data
Output=data["Class"]
In [57]:
display(Input.head()) 
display(Output.head())
ID geneEn variationEn textEn
0 0 85 2629 532
1 1 39 2856 36
2 2 39 1897 36
3 3 39 1667 1557
4 4 39 1447 1322
0    1
1    2
2    2
3    3
4    4
Name: Class, dtype: int64

Further Split Input and Output into Train and Test Sets

In [58]:
#split our data into train and test sets  
from sklearn.model_selection import train_test_split 
inputTrain, inputTest, outputTrain, outputTest= train_test_split(Input, Output, test_size=.2)
In [59]:
display(inputTrain.head()) 
display(outputTrain.head())
ID geneEn variationEn textEn
542 542 230 1256 1818
387 387 252 2041 1773
1617 1617 258 2809 643
2432 2432 31 1386 44
2087 2087 2 2202 325
542     1
387     1
1617    4
2432    1
2087    1
Name: Class, dtype: int64

Data Observations
¶

Before applying any models we first analyze our data to see if there are any noticeable patterns.

Basic Information:

  • 3320 rows
  • 264 different genes, 2985 different variations
  • no missing values
  • every gene/variation combo is unique

We mainly look at encoded data

In [61]:
dataEncoding=data.drop(["ID","Gene", "Variation", "Text"], axis='columns')  
dataEncoding
Out[61]:
Class geneEn variationEn textEn
0 1 85 2629 532
1 2 39 2856 36
2 2 39 1897 36
3 3 39 1667 1557
4 4 39 1447 1322
... ... ... ... ...
3316 4 221 306 970
3317 1 221 28 968
3318 1 221 807 1642
3319 4 221 2249 1646
3320 4 221 1333 702

3321 rows × 4 columns

Bar Chart
The following chart shows the count of the different classes of genetic mutations.

In [63]:
barchart= plt.bar(classes, counts) 
plt.title('Count of Classes') 
plt.xlabel("Class") 
plt.ylabel("Count")
Out[63]:
Text(0, 0.5, 'Count')

Heat Map

The following map shows the correlation between two variables. Since correlation is a measure of the linear relationship between these variables, we can assume there is little to no linear relationships between any two variables.

In [65]:
correlation= dataEncoding.corr() 
sns.heatmap(correlation, cmap='PuOr') 
Out[65]:
<AxesSubplot:>

Pivot Table

The following table is able to display different relationships between the variables.

Model 1: Decision Tree
¶

Splits data into nodes and subnodes based on similarities in gene, variation, and text in order to predict the class of the genetic mutation.
Decision%20Tree%20example.png

https://venngage.com/blog/what-is-a-decision-tree/

The Decision Tree Model for this data yields approximately a 55% accuracy rating of predicting the class of genetic mutation (from 1-9).

Decision Tree Model Code:

In [68]:
#MODEL 1 IS THE DECISION TREE MODEL 
from sklearn.tree import DecisionTreeClassifier 
model1= DecisionTreeClassifier()  

#have the model fit out train data 
model1.fit(inputTrain, outputTrain)  

#have the model make predictions based off our test input 
predictions1=model1.predict(inputTest)

#see how accurate our predictions of test are by comparing with the test output   
from sklearn.metrics import accuracy_score 
score1= accuracy_score(outputTest, predictions1) #relationship between predictions and outputTest  

#display the score 
score1
Out[68]:
0.5218045112781955

Model 2: Single Value Decomposition (SVD)
¶

SVD takes our data frame matrix and splits it into 3 smaller matrices (U, Sigma, V^T) that when multiplied obtain the original matrix. These matrices are generated from the data's latent features and the strength of these latent features. SVD is useful in dimensionality reduction. svd%20picture.jpeg https://towardsdatascience.com/understanding-singular-value-decomposition-and-its-application-in-data-science-388a54be95d

SVD Model Code

In [73]:
from surprise import SVD

#create a reader, and the range of classes 1-9  
reader= Reader(rating_scale= (1,9)) 

#SVD can only have three columns with class listed last, create entire SVD data  
svdData=dataEncoding[["geneEn", 'variationEn', 'Class']]  

#create the data for the model using the reader  
svd=Dataset.load_from_df(svdData, reader)  

#split the data into train and test sets  
svdTrain, svdTest= train_test_split(svd, test_size=.2)  

#create SVD model with 100 latent features(common number to use) 
model= SVD(n_factors=50)  

#fit the model to the svd training data 
model.fit(svdTrain)
Out[73]:
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f99d25cea00>

SVD Code Continued

In [75]:
#determine the length of the test set 
length= len(svdTest) 

#create an array of the test set class answers 
answers=[] 
i=0 
while i<=length-1: 
    answers.append(svdTest[i][2]) 
    i=i+1  
    
In [78]:
#create an array of the class predictions from the svd test set
predictions=[] 
i=0 
while i<=length-1:  
    #round the predictions to the nearest class 
    predictions.append(round(testing[i][3]))
    i=i+1 
    

The SVD Model for this data yields a 10-20% accuracy rating on predicting the class of genetic mutation.

SVD works best when there are more columns and fewer rows for dimensionality reduction. Since this data has many rows and few columns this could be the reason for this model's low success.

In [80]:
#determine the accuracy between the class answers and class predictions from the svd model 
score=accuracy_score(predictions, answers) 
print(score)
0.20150375939849624

Model 3: Random Forest
¶

Similiar to Decision Tree, but random forest builds multiple decision trees and then merges them together.
Random%20Forest%20pic.png

https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/

Random Forest yields a 60% accuracy rating on predicting the class of genetic mutation.

Random Forest Model Code:

In [82]:
#use the trainInput, trainOutput, testInput, and testOutput data from model 1 
model3=RandomForestClassifier(n_estimators=100) 

#have the model fit the train data
model3.fit(inputTrain, outputTrain) 

#have the model make predictions based off out test inputs
predictions3= model3.predict(inputTest) 

#determine the accuracy of the predictions 
score3= accuracy_score(outputTest, predictions3) 

#print the score  
score3
Out[82]:
0.6345864661654136

Model 4: Logistic Regression
¶

Logistic Regression uses a statistical model to find the logarithmic relationship between the inputs and the class. logistic%20regression%20pic.png

https://scipython.com/blog/logistic-regression-for-image-classification/

Logistic Regressions yields a 33% accuarcy in predicting the class (1-9) of genetic mutations.

Logistic Regression Model Code:

In [85]:
model4=LogisticRegression(solver='liblinear', random_state=0) 

model4.fit(inputTrain, outputTrain) 

intercept= model4.intercept_
b=model4.coef_ 

predictions4= model4.predict(inputTest) 


score4=accuracy_score(predictions4, outputTest) 

score4
Out[85]:
0.35037593984962406

K Nearest Neighbor
¶

K Nearest Neighbor predicts the class by looking at the category of its nearest "neighbor" or closest data point. k%20nearest%20neighbor%20pic.png

https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/

K Nearest neighbor yields a 44% accuracy in predicting the class (1-9) of genetic mutation.

K Nearest Neighbor Model Code:

In [88]:
# k nearest neighbor 
from sklearn.neighbors import NearestNeighbors 
from sklearn.neighbors import KNeighborsClassifier  

model5= KNeighborsClassifier(n_neighbors=1)  
model5.fit(inputTrain, outputTrain)
predictions5= model5.predict(inputTest) 
 
score5=accuracy_score(predictions5, outputTest) 
score5
Out[88]:
0.47218045112781953

Summary of Results
¶

Random Forest provided the best model, while SVD and Logistic Regression performed the worst.¶

summary%20results.png

How this Study Can be Furthered¶

  • finding more models: whether it be applying a neural network model or separting the data even further and applying different models to subsets of data
  • simplifying the text data even further: finding other ways to analyze its usefulness instead of just encoding it

Challenges Throughout the Study¶

  • analyzing the text data: having to separate, clean, merge, and encode before being able to apply the different models
  • separating into train and test sets: had to use Jupyters train_test_split class due to the different counts of the classes
  • finding a model that could be applied and worked for this data: since all gene/variation combos were unique, it is hard to train a set and then apply that model to a test set that has many differences

Questions?¶

Sources
¶

Bagheri, Reza. (2020 January 9). Understanding Singular Value Decomposition and Its Application in Data Science. Towards Data Science. https://towardsdatascience.com/understanding-singular-value-decomposition-and-its- application-in-data-science-388a54be95d

Kaggle. (2017). Personalized Medicine: Redefining Cancer Treatment. https://www.kaggle.com/c/msk-redefining-cancer-treatment

Logistic Regression for Image Classification. (2020 September 3). https://scipython.com/blog/logistic-regression-for-image-classification/

Sharma, Abhishek. (2020 May 12). Decision tree vs. Random Forest- Which Algorithm Should You Use? Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/

Sottoriva, Andrea. Computational Biology. Human Technopole. https://humantechnopole.it/en/research-centres/computational-biology/

Srivastava, Tavish. (2018 March 26). Introduction to k-Nearest Neighbors: A Powerful Machine Learning Algorithm. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/