The main goal is to predict whether the passenger survived based on attributes such as age, sex, class, and where they embarked and so on.
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
TITANIC_PATH = os.path.join("/Users/yanghehuo/Documents/coursra/ML/datasets","Titan")
import pandas as pd
def read_data(file_name, path):
csv_path = os.path.join(path, file_name)
return pd.read_csv(csv_path)
train_data = read_data("Titan.csv",TITANIC_PATH)
test_data = read_data("test.csv",TITANIC_PATH)
Let's take a peek at the top few row of the dataset
train_data.head()
The attributes have following meaning:
Let's get more informaion to see how much data is missing
train_data.info()
The Cabin, Age, and Embarked has some missing values. Especially Cabin 77% are null. We will ignore it for now and focus on others. The Age attribute has about 19% null values, replacing the null values with the median seems promissing.
Name and Ticket variables are hard to convert to useful numbers that the algorithm can consume. So we may ignore them for now as well.
Let's look at the numerical variables:
train_data.describe()
38% passengers survived, which is close to 50%. So accuracy can be used as a performance measure.
Let's see whether the Survived variables takes value 0 and 1.
train_data['Survived'].value_counts()
train_data['Sex'].value_counts()
train_data['Embarked'].value_counts()
# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, no need to try to
# understand every line.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
class CategoricalEncoder(BaseEstimator, TransformerMixin):
"""Encode categorical features as a numeric array.
The input to this transformer should be a matrix of integers or strings,
denoting the values taken on by categorical (discrete) features.
The features can be encoded using a one-hot aka one-of-K scheme
(``encoding='onehot'``, the default) or converted to ordinal integers
(``encoding='ordinal'``).
This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.
Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
Parameters
----------
encoding : str, 'onehot', 'onehot-dense' or 'ordinal'
The type of encoding to use (default is 'onehot'):
- 'onehot': encode the features using a one-hot aka one-of-K scheme
(or also called 'dummy' encoding). This creates a binary column for
each category and returns a sparse matrix.
- 'onehot-dense': the same as 'onehot' but returns a dense array
instead of a sparse matrix.
- 'ordinal': encode the features as ordinal integers. This results in
a single column of integers (0 to n_categories - 1) per feature.
categories : 'auto' or a list of lists/arrays of values.
Categories (unique values) per feature:
- 'auto' : Determine categories automatically from the training data.
- list : ``categories[i]`` holds the categories expected in the ith
column. The passed categories are sorted before encoding the data
(used categories can be found in the ``categories_`` attribute).
dtype : number type, default np.float64
Desired dtype of output.
handle_unknown : 'error' (default) or 'ignore'
Whether to raise an error or ignore if a unknown categorical feature is
present during transform (default is to raise). When this is parameter
is set to 'ignore' and an unknown category is encountered during
transform, the resulting one-hot encoded columns for this feature
will be all zeros.
Ignoring unknown categories is not supported for
``encoding='ordinal'``.
Attributes
----------
categories_ : list of arrays
The categories of each feature determined during fitting. When
categories were specified manually, this holds the sorted categories
(in order corresponding with output of `transform`).
Examples
--------
Given a dataset with three features and two samples, we let the encoder
find the maximum value per feature and transform the data to a binary
one-hot encoding.
>>> from sklearn.preprocessing import CategoricalEncoder
>>> enc = CategoricalEncoder(handle_unknown='ignore')
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
... # doctest: +ELLIPSIS
CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
encoding='onehot', handle_unknown='ignore')
>>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()
array([[ 1., 0., 0., 1., 0., 0., 1., 0., 0.],
[ 0., 1., 1., 0., 0., 0., 0., 0., 0.]])
See also
--------
sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
integer ordinal features. The ``OneHotEncoder assumes`` that input
features take on values in the range ``[0, max(feature)]`` instead of
using the unique values.
sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
encoding of dictionary items or strings.
"""
def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
handle_unknown='error'):
self.encoding = encoding
self.categories = categories
self.dtype = dtype
self.handle_unknown = handle_unknown
def fit(self, X, y=None):
"""Fit the CategoricalEncoder to X.
Parameters
----------
X : array-like, shape [n_samples, n_feature]
The data to determine the categories of each feature.
Returns
-------
self
"""
if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
template = ("encoding should be either 'onehot', 'onehot-dense' "
"or 'ordinal', got %s")
raise ValueError(template % self.handle_unknown)
if self.handle_unknown not in ['error', 'ignore']:
template = ("handle_unknown should be either 'error' or "
"'ignore', got %s")
raise ValueError(template % self.handle_unknown)
if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
raise ValueError("handle_unknown='ignore' is not supported for"
" encoding='ordinal'")
X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
n_samples, n_features = X.shape
self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
for i in range(n_features):
le = self._label_encoders_[i]
Xi = X[:, i]
if self.categories == 'auto':
le.fit(Xi)
else:
valid_mask = np.in1d(Xi, self.categories[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(Xi[~valid_mask])
msg = ("Found unknown categories {0} in column {1}"
" during fit".format(diff, i))
raise ValueError(msg)
le.classes_ = np.array(np.sort(self.categories[i]))
self.categories_ = [le.classes_ for le in self._label_encoders_]
return self
def transform(self, X):
"""Transform X using one-hot encoding.
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data to encode.
Returns
-------
X_out : sparse matrix or a 2-d array
Transformed input.
"""
X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
n_samples, n_features = X.shape
X_int = np.zeros_like(X, dtype=np.int)
X_mask = np.ones_like(X, dtype=np.bool)
for i in range(n_features):
valid_mask = np.in1d(X[:, i], self.categories_[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
# Set the problematic rows to an acceptable value and
# continue `The rows are marked `X_mask` and will be
# removed later.
X_mask[:, i] = valid_mask
X[:, i][~valid_mask] = self.categories_[i][0]
X_int[:, i] = self._label_encoders_[i].transform(X[:, i])
if self.encoding == 'ordinal':
return X_int.astype(self.dtype, copy=False)
mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
indices = np.cumsum(n_values)
column_indices = (X_int + indices[:-1]).ravel()[mask]
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
n_features)[mask]
data = np.ones(n_samples * n_features)[mask]
out = sparse.csc_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return out
# AgeBucket
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()
# relatives onboard
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()
# extract title from Name attribtute
train_data['Title'] = train_data['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
train_data['Title'].value_counts()
train_data["Title"] = train_data["Title"].replace(['Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Others')
train_data["Title"] = train_data["Title"].replace(["Miss","Ms", "Mme", "Mlle","Mrs", 'Countess','Lady', 'the Countess'],'Mme')
train_data['Title'].value_counts()
# create a new feature called "IsAlone"
train_data['IsAlone'] = 0
train_data.loc[train_data['RelativesOnboard'] == 0, 'IsAlone'] = 1
train_data[["IsAlone", "Survived"]].groupby(['IsAlone']).mean()
train_data.head()
# extract the Desk from the Cabin
train_data["Cabin"][train_data["Cabin"].notnull()].head()
# Replace the Cabin number by the type of cabin 'X' if not
train_data["Desk"] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in train_data['Cabin'] ])
train_data.head()
Build the data preparation pipeline.
# Data Frame Selector
from sklearn.base import BaseEstimator, TransformerMixin
# A class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names]
# A class to impute the categorical variable, by replacing the missing value with the most frequent value
class MostFrequentImputer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
self.most_frequent = pd.Series([X[c].value_counts().index[0] for c in X.columns]
,index = X.columns)
return self
def transform(self, X):
return X.fillna(self.most_frequent)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
num_attributes = ['RelativesOnboard','Fare']
cat_attributes = ['Pclass','Sex','Embarked','AgeBucket','Title','IsAlone','Desk'] # 3, 2, 3, 6, 5, 2, 9
num_pipeline = Pipeline([
("DataFrameSelector",DataFrameSelector(num_attributes)),
("Imputer",Imputer(strategy='median')),
("std_scaler",StandardScaler())
])
cat_pipeline = Pipeline([
("DataFrameSelector",DataFrameSelector(cat_attributes)),
("MostFrequentImputer",MostFrequentImputer()),
("cat_encoder", CategoricalEncoder(encoding='onehot-dense'))
])
Combine the two pipelines
from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(transformer_list = [
('num_pipeline',num_pipeline),
('cat_pipeline',cat_pipeline)
])
X_train = preprocess_pipeline.fit_transform(train_data)
X_train.shape
# get the label (target)
y_train = train_data["Survived"]
I compared 9 popular classifiers and evaluate the mean accuracy of each of them using a stratified kfold cross validation procedure.
#Common Model Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
#Common Model Helpers
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, cross_validate
#
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=5)
# Initialize different algorithms with default parameters
random_state = 2
classifiers = [
LogisticRegression(random_state = random_state),
LinearDiscriminantAnalysis(),
KNeighborsClassifier(),
SVC(random_state=random_state),
DecisionTreeClassifier(random_state=random_state),
RandomForestClassifier(random_state=random_state),
AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1),
GradientBoostingClassifier(random_state=random_state),
MLPClassifier(random_state=random_state)
]
# save the cross-validation results
#create table to compare MLA metrics
res_columns = ['CLF Name', 'CLF Parameters','CLF Test Accuracy Mean', 'CLF Test Accuracy STD']
res_compare = pd.DataFrame(columns = res_columns)
res_predict = y_train.copy()
row_index = 0
for CLF in classifiers:
CLF_name = CLF.__class__.__name__
res_compare.loc[row_index, 'CLF Name'] = CLF_name
res_compare.loc[row_index, 'CLF Parameters'] = str(CLF.get_params())
cv_result = cross_val_score(CLF, X_train, y = y_train, scoring = "accuracy", cv = kfold, n_jobs=2)
res_compare.loc[row_index, 'CLF Test Accuracy Mean'] = cv_result.mean()
res_compare.loc[row_index, 'CLF Test Accuracy STD'] = cv_result.std()
# CLF.fit(X_train, y_train)
# res_predict[CLF_name] = CLF.predict(X_train)
row_index+=1
res_compare.sort_values(by = ['CLF Test Accuracy Mean'], ascending = False, inplace = True)
res_compare
sns.barplot(x='CLF Test Accuracy Mean', y = 'CLF Name', data = res_compare, palette="Set3")
Let's perform a grid search optimization for AdaBoost, RandomForest, GradientBoosting and SVC classifiers.
# Adaboost
DTC = DecisionTreeClassifier()
adaDTC = AdaBoostClassifier(DTC, random_state=2)
ada_param = [{
"base_estimator__criterion" : ["gini", "entropy"],
"base_estimator__splitter" : ["best", "random"],
"n_estimators" :[30,100,300],
"learning_rate": [0.01, 0.03, 0.1, 0.3]
}]
gs_adaDTC = GridSearchCV(adaDTC,param_grid = ada_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)
gs_adaDTC.fit(X_train,y_train)
gs_adaDTC.best_params_
# Random Forest
RFC = RandomForestClassifier()
rf_param = [
{"bootstrap": [False],'n_estimators':[30,100,300],'max_features':[3,10,30]}
]
gs_RFC = GridSearchCV(RFC,param_grid = rf_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)
gs_RFC.fit(X_train,y_train)
gs_RFC.best_params_
# Gradient boosting
GBC = GradientBoostingClassifier()
gb_param = [{
'n_estimators' : [100,300],
'learning_rate': [0.01, 0.03, 0.1],
'max_depth': [3, 10],
}]
gs_GBC = GridSearchCV(GBC,param_grid = gb_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)
gs_GBC.fit(X_train,y_train)
gs_GBC.best_params_
# SVC
SVMC = SVC()
SVM_param = [{
'kernel': ['rbf'],
'gamma': [ 0.01, 0.03, 0.1],
'C': [1, 3, 10, 30,100],
'probability': [True]
}]
gs_SVM = GridSearchCV(SVMC,param_grid = SVM_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)
gs_SVM.fit(X_train,y_train)
gs_SVM.best_params_
Let's plot the learning curve to see accuracy score against training size. It's a good way to see the overfitting effect.
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
"""generate training and testing dataset training curve."""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("the size of Training set")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
# train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
# test_scores_std = np.std(test_scores, axis=1)
plt.grid()
# plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1,color="r")
# plt.fill_between(train_sizes, test_scores_mean - test_scores_std,test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Cross-validation score")
plt.legend(loc="best")
return
g = plot_learning_curve(gs_RFC.best_estimator_,"RF mearning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_SVM.best_estimator_,"SVC learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_adaDTC.best_estimator_,"AdaBoost learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_GBC.best_estimator_,"GradientBoosting learning curves",X_train,y_train,cv=kfold)
SVM classifier seems generalize better prediction since the training and testing cross-validation learning curve are close to each other.
Adaboost, Gradient boossting, and Random Forest seems overfitting the training set, since there is a big gap between two curves. We see that the testing learning curve goes up while increasing training set size, so the three algorithms will perform better if we feed them more training data.
Let's use a voting classifier to combine the predictions coming from the 5 classifiers.
I preferred to pass the argument "soft" to the voting parameter to take into account the probability of each vote.
SVM_best = gs_SVM.best_estimator_
RFC_best = gs_RFC.best_estimator_
ada_best = gs_adaDTC.best_estimator_
GBC_best = gs_GBC.best_estimator_
votingC = VotingClassifier(estimators=[('rfc', RFC_best),
('svc', SVM_best), ('adac',ada_best),('gbc',GBC_best)], voting='soft', n_jobs=2)
votingC = votingC.fit(X_train, y_train)
test_Survived = pd.Series(votingC.predict(test), name="Survived")
results = pd.concat([IDtest,test_Survived],axis=1)
results.to_csv("ensemble_python_voting.csv",index=False)