Data Science Work Samples
Machine Learning
Support Vector Machine with Feature Selection | Python
Background:
We were provided a simulated dataset of single nucleotide polymorphism (SNP) genotype data containing 29,623 SNPs (total features). Amongst all SNPs are 15 causal ones which mean they and neighboring ones discriminate between case and controls while the remainder is noise. In the training are 4000 cases and 4000 controls.
Goal:
My task is to predict the labels of 2000 test individuals whose true labels are known only to our professor. The goal is not only to predict true labels of the test data set with the highest accuracy but also using the fewest number of features. The project code is written in Python.
Both datasets and labels files are as follows:
The training dataset - traindata.gz
training labels – trueclass
test dataset - testdata.gz
I wrote code for Pearson Correlation to select top 15 features and used SVM module from python’s sklearn package to do the final classification.
Achievement:
I wrote Pearson correlation code in python and got best 15 features. Ran Support Vector Machine (SVM, Linear Kernel) module and created a classification model & cross-validation accuracy score of 64%.
Link to my Code:
https://github.com/samtha83/Machine-Learning-Algorithms/tree/master/Project
Data Visualization
Dashboard/Analysis | TABLEAU
Background:
The idea of this project to learn how to be a storyteller using our visualization skills. I learned how to extract some insights from a given set of data by applying basic stats, display those insights in meaningful graphs and in the process develop an interesting story.
We choose Rio Olympics data and focus on a few key questions:
-What makes the top athletes special?
-Is there some kind of metric that we can use to see which athlete is more likely to earn a medal?
We looked at trends that dealt with age, country, and gender
Goal:
To learn the TABLEAU tool to analyze data and build graphs.
Achievement
I Analyzed Rio Olympics dataset, 11500 records containing each athlete’s data and using Tableau build a dashboard, which shows top winners based on age, gender, and country.
Link to the dashboard:
https://public.tableau.com/views/DataVisualizationProject-Group6/Story1?:embed=y&:display_count=yes
Data Mining
Naive Bayes & Decission Tree Implementation| Rapid Miner
Background:
Employee Attrition: Causes of employee attrition can be as varied as human personalities, but some basic factors pervade most reasons for resignation. In this study, we focus on a fictional dataset created by IBM data scientist which was hosted in Kaggle.com.The dataset contained 35 attributes out of which we cleaned and used only 23 attributes. We removed some attributes mostly had the same values for the entire dataset or the values were irrelevant. The resultant dataset of 23 attributes and 1470 records. We loaded data into 'Rapid Miner' applications and applied pre-build models (algorithms) from the application's library.
Goal:
Apply two data classification algorithms - Naive Bayes & Decision Trees to predict the most accurate model to detect employees who might leave or stay with the company. Attrition is our class variable and has 2 values viz. yes=that the employee has left and no=that the employee is retained.
Achievement:
Our aim was to create a model that predicts whether or not an employee will attrite based on our known variables. There are two possible predicted classes: "yes" and "no".
The essence of a confusion matrix for attrition scenario can be summarized as below, taking the class “No” as positive class.
True Positive(TP): Employees predicted as no and they, in reality, did not leave the company.
False Positive(FP): Employees predicted as no and they, in reality, did leave the company.
True Negative(TN): Employees predicted as Yes and they, in reality, did leave the company.
False Negative(FN): Employees predicted as Yes and they, in reality, did not leave the company.
No(Pred) Yes(Pred)
No(True) TP FP (Type I error)
Yes(True) FN(Type II error) TN
Here,
Recall: TP/(TP+FN)
Precision: TP/(TP+FP)
An organization would like to know, out of all the employees, who would continue to work in their company for a longer number of years. HR would be more interested in hiring a candidate whose likelihood of staying is high, than whose chances of attrition is less. Thus we want a model which can get truer positive cases.
Also, a company would be concerned if the percentage of attrition in current employees is higher. It makes more sense for a company to detect all employees who would actually leave, even if in that process we incorrectly predicting some employees as “Yes”, though they didn’t leave. Thus we want a model which can a high value of TN, so that chances of Type II error are less.
Hence we need model which has better recall at the cost of precision and the recall for Naïve Bayes is better for both cases where employees stay (label: No) and leave (label: Yes) rather than Decision tree model where the recall for only employees who stay back (label: No) is good.
Link to the complete report:
Data Mining Project - Group 6.pdf
Big Data
Hadoop Installation for Flight Data Analysis | Amazon Web Services
Background:
We had to analyze the Airline On-time Performance data set (flight data set) from the period from October 1987 to April 2008. The dataset to be downloaded from http://statcomputing.org/dataexpo/2009/the-data.html. Since this dataset can be classified as big data, I had to use big data tools and techniques for analysis.
Goal:
1. Install Hadoop/Oozie on your AWS VMs.
2. Design, implement and run an Oozie workflow to find out
a. the 3 airlines with the highest and lowest probability, respectively, for being on schedule;
b. the 3 airports with the longest and shortest average taxi time per flight (both in and out) respectively.
c. the most common reason for flight cancellations.
Achievement:
We setup a cluster of 8 nodes on AWS and installed Hadoop. Wrote 3 map-reduce programs that run in fully distributed mode to solve each problem. Setup Oozie workflow to run the 3 programs in serial order.
We tested the efficiency of our process by recording the execution time as follows:
-Ran the workflow to analyze the entire data set (total 22 years from 1987 to 2008) at one time on two VMs first and then gradually increase the system scale to the maximum allowed number of VMs for at least 5 increment steps,
and measure each corresponding workflow execution time. We concluded that performance is directly proportional to the number of resources used for processing big data
-Ran the workflow to analyze the data in a progressive manner with an increment of 1 year, i.e. the first year (1987), the first 2 years (1987-1988), the first 3 years (1987-1989), …, and the total 22 years (1987-2008), on the maximum allowed number of VMs,
and measure each corresponding workflow execution time. We concluded that performance is inversely proportional to input data.
Link to project code:
Neural Networks
Image Data Analysis | Convolutional Neural Network (CNN), Python
Background:
I had to implement the paper "Learning Physical Intuition of Block Towers by Example" (by- Adam Lerer, Sam Gross, Rob Fergus, the Facebook AI Research). The objective of the project is to analyze a large image dataset of 3/4 blocks stacked upon each other, build an algorithm to determine if a new image of blocks is given to us --will the series of blocks fall over or not? This is a binary classification problem based on block configuration.
Since the dataset used by the authors was not publicly available, I used a similar dataset provided by MIT paper- http://blocks.csail.mit.edu.
The dataset uses both real and synthetic data and records human & machine responses for each dataset. However, I used only a synthetic dataset and recorded the system’s responses.
Goal:
To implement not only CNN but baseline models like Support Vector Machine and Random forest to detect if a "block" will fall or not.
Achievement:
I developed the following CNNs in my algorithm:
Epoch =50, Batch Size=4
Resized the image =32 X 32 X 3
64 Conv filters of size 3X3 and activation = relu
32 Conv filters of size 3X3 and activation = relu
MaxPooling layer of size 2X2
Flattened the data
Dense layer of 256 neurons
Dropout of 0.5
Finally Dense layer equal to a number of classes and activation = softmax
I drew the final conclusion:
CNN given in the paper has an accuracy of 86% on synthetic image data and the paper states if the training set is huge accuracy increases.
My CNN model has an accuracy less than what is achieved by the paper.
However, we were also able to conclude the same that a CNN improves its performance by increasing the training data
Baseline models do not change the performance-based number of the training set
Link to my presentation: Learning Physical Intuition of Block Towers by Example
Link to my code: https://github.com/samtha83/CS-698-/tree/master/Project-CNNs