Preliminary analysis and Feature engineering

The main goal is to predict whether the passenger survived based on attributes such as age, sex, class, and where they embarked and so on.

In [38]:
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

TITANIC_PATH = os.path.join("/Users/yanghehuo/Documents/coursra/ML/datasets","Titan")
In [2]:
import pandas as pd
def read_data(file_name, path):
    csv_path = os.path.join(path, file_name)
    return pd.read_csv(csv_path)
train_data = read_data("Titan.csv",TITANIC_PATH)
test_data = read_data("test.csv",TITANIC_PATH)

Let's take a peek at the top few row of the dataset

In [3]:
train_data.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The attributes have following meaning:

  • PassengerId: unique identifier of a passenger.
  • Servived: Target variable. It contains two values, 0 and 1. 0 means the passenger didn't servive, 1 means the passenger survived.
  • Pclass: indicates the ticket's class. 1 = 1st, 2 = 2nd, 3 = 3rd
  • Name, Sex, Age: self_explanatory
  • SibSp: number of siblings / spouses aboard the Titanic
  • Parch: number of parents / children aboard the Titanic
  • Ticket:Ticket number
  • Fare: Price paid
  • Cabin: Cabin number
  • Embarked: where the passenger embarded on Titanic. C = Cherbourg, Q = Queenstown, S = Southampton

Let's get more informaion to see how much data is missing

In [4]:
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

The Cabin, Age, and Embarked has some missing values. Especially Cabin 77% are null. We will ignore it for now and focus on others. The Age attribute has about 19% null values, replacing the null values with the median seems promissing.

Name and Ticket variables are hard to convert to useful numbers that the algorithm can consume. So we may ignore them for now as well.

Let's look at the numerical variables:

In [5]:
train_data.describe()
Out[5]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

38% passengers survived, which is close to 50%. So accuracy can be used as a performance measure.

Let's see whether the Survived variables takes value 0 and 1.

In [6]:
train_data['Survived'].value_counts()
Out[6]:
0    549
1    342
Name: Survived, dtype: int64
In [7]:
train_data['Sex'].value_counts()
Out[7]:
male      577
female    314
Name: Sex, dtype: int64
In [8]:
train_data['Embarked'].value_counts()
Out[8]:
S    644
C    168
Q     77
Name: Embarked, dtype: int64
In [9]:
# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, no need to try to
# understand every line.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    """Encode categorical features as a numeric array.
    The input to this transformer should be a matrix of integers or strings,
    denoting the values taken on by categorical (discrete) features.
    The features can be encoded using a one-hot aka one-of-K scheme
    (``encoding='onehot'``, the default) or converted to ordinal integers
    (``encoding='ordinal'``).
    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.
    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
    Parameters
    ----------
    encoding : str, 'onehot', 'onehot-dense' or 'ordinal'
        The type of encoding to use (default is 'onehot'):
        - 'onehot': encode the features using a one-hot aka one-of-K scheme
          (or also called 'dummy' encoding). This creates a binary column for
          each category and returns a sparse matrix.
        - 'onehot-dense': the same as 'onehot' but returns a dense array
          instead of a sparse matrix.
        - 'ordinal': encode the features as ordinal integers. This results in
          a single column of integers (0 to n_categories - 1) per feature.
    categories : 'auto' or a list of lists/arrays of values.
        Categories (unique values) per feature:
        - 'auto' : Determine categories automatically from the training data.
        - list : ``categories[i]`` holds the categories expected in the ith
          column. The passed categories are sorted before encoding the data
          (used categories can be found in the ``categories_`` attribute).
    dtype : number type, default np.float64
        Desired dtype of output.
    handle_unknown : 'error' (default) or 'ignore'
        Whether to raise an error or ignore if a unknown categorical feature is
        present during transform (default is to raise). When this is parameter
        is set to 'ignore' and an unknown category is encountered during
        transform, the resulting one-hot encoded columns for this feature
        will be all zeros.
        Ignoring unknown categories is not supported for
        ``encoding='ordinal'``.
    Attributes
    ----------
    categories_ : list of arrays
        The categories of each feature determined during fitting. When
        categories were specified manually, this holds the sorted categories
        (in order corresponding with output of `transform`).
    Examples
    --------
    Given a dataset with three features and two samples, we let the encoder
    find the maximum value per feature and transform the data to a binary
    one-hot encoding.
    >>> from sklearn.preprocessing import CategoricalEncoder
    >>> enc = CategoricalEncoder(handle_unknown='ignore')
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
    ... # doctest: +ELLIPSIS
    CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
              encoding='onehot', handle_unknown='ignore')
    >>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
    See also
    --------
    sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
      integer ordinal features. The ``OneHotEncoder assumes`` that input
      features take on values in the range ``[0, max(feature)]`` instead of
      using the unique values.
    sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
      dictionary items (also handles string-valued features).
    sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
      encoding of dictionary items or strings.
    """

    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):
        """Fit the CategoricalEncoder to X.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
            The data to determine the categories of each feature.
        Returns
        -------
        self
        """

        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
            Transformed input.
        """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out

Feature Enginnering

  • Try adding more promising features in to dataset, for example:
    • replace SibSp and Parch with their sum,
    • try to identify parts of names that correlate well with the Survived attribute (e.g. if the name contains "Countess", then survival seems more likely),
    • extract the first letter of Cabin as the Desk, if missing, put 'X'
  • Try to convert numerical attributes to categorical attributes: for example, different age groups had very different survival rates (see below), so it may help to create an age bucket category and use it instead of the age.
  • Similarly, it may be useful to have a special category for people traveling alone since only 30% of them survived (see below).
In [10]:
# AgeBucket
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()
Out[10]:
Survived
AgeBucket
0.0 0.576923
15.0 0.362745
30.0 0.423256
45.0 0.404494
60.0 0.240000
75.0 1.000000
In [11]:
# relatives onboard
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()
Out[11]:
Survived
RelativesOnboard
0 0.303538
1 0.552795
2 0.578431
3 0.724138
4 0.200000
5 0.136364
6 0.333333
7 0.000000
10 0.000000
In [12]:
# extract title from Name attribtute
train_data['Title'] = train_data['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
train_data['Title'].value_counts()
Out[12]:
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Sir               1
Lady              1
Don               1
Ms                1
the Countess      1
Mme               1
Jonkheer          1
Capt              1
Name: Title, dtype: int64
In [13]:
train_data["Title"] = train_data["Title"].replace(['Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Others')
train_data["Title"] = train_data["Title"].replace(["Miss","Ms", "Mme", "Mlle","Mrs", 'Countess','Lady', 'the Countess'],'Mme')
train_data['Title'].value_counts()
Out[13]:
Mr        517
Mme       313
Master     40
Others     21
Name: Title, dtype: int64
In [14]:
# create a new feature called "IsAlone"
train_data['IsAlone'] = 0
train_data.loc[train_data['RelativesOnboard'] == 0, 'IsAlone'] = 1

train_data[["IsAlone", "Survived"]].groupby(['IsAlone']).mean()
Out[14]:
Survived
IsAlone
0 0.505650
1 0.303538
In [15]:
train_data.head()
Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBucket RelativesOnboard Title IsAlone
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 15.0 1 Mr 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 30.0 1 Mme 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 15.0 0 Mme 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 30.0 1 Mme 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 30.0 0 Mr 1
In [16]:
# extract the Desk from the Cabin
train_data["Cabin"][train_data["Cabin"].notnull()].head()
Out[16]:
1      C85
3     C123
6      E46
10      G6
11    C103
Name: Cabin, dtype: object
In [17]:
# Replace the Cabin number by the type of cabin 'X' if not
train_data["Desk"] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in train_data['Cabin'] ])
In [18]:
train_data.head()
Out[18]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeBucket RelativesOnboard Title IsAlone Desk
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 15.0 1 Mr 0 X
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 30.0 1 Mme 0 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 15.0 0 Mme 1 X
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 30.0 1 Mme 0 C
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 30.0 0 Mr 1 X

Data preparation pipeline

Build the data preparation pipeline.

  • DataFrameSelector: select specific attributes from data frame
  • Imputer for categorical variabes (by replacing the missing value with the most frequent value
  • Imputer for numerical variables (available in sklearn)
  • Scalar (available in sklearn)
In [19]:
# Data Frame Selector

from sklearn.base import BaseEstimator, TransformerMixin

# A class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]
In [20]:
# A class to impute the categorical variable, by replacing the missing value with the most frequent value

class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent = pd.Series([X[c].value_counts().index[0] for c in X.columns]
                                       ,index = X.columns)
        return self
    def transform(self, X):
        return X.fillna(self.most_frequent)
In [21]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
In [22]:
num_attributes = ['RelativesOnboard','Fare']
cat_attributes = ['Pclass','Sex','Embarked','AgeBucket','Title','IsAlone','Desk'] # 3, 2, 3, 6, 5, 2, 9
In [23]:
num_pipeline = Pipeline([
    ("DataFrameSelector",DataFrameSelector(num_attributes)),
    ("Imputer",Imputer(strategy='median')),
    ("std_scaler",StandardScaler())
])

cat_pipeline = Pipeline([
    ("DataFrameSelector",DataFrameSelector(cat_attributes)),
    ("MostFrequentImputer",MostFrequentImputer()),
    ("cat_encoder", CategoricalEncoder(encoding='onehot-dense'))
])

Combine the two pipelines

In [24]:
from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(transformer_list = [
    ('num_pipeline',num_pipeline),
    ('cat_pipeline',cat_pipeline)
])
In [25]:
X_train = preprocess_pipeline.fit_transform(train_data)
X_train.shape
Out[25]:
(891, 31)
In [26]:
# get the label (target)
y_train = train_data["Survived"]

Modeling

Select and train a model

I compared 9 popular classifiers and evaluate the mean accuracy of each of them using a stratified kfold cross validation procedure.

  • Logistic regression
  • Linear Discriminant Analysis
  • KNN
  • SVM
  • Decision Tree
  • Random Forest
  • AdaBoost
  • Gradient Boosting
  • neural network (Multiple layer perceprton)
In [27]:
#Common Model Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier

#Common Model Helpers
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, cross_validate

# 
In [28]:
# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=5)
In [29]:
# Initialize different algorithms with default parameters
random_state = 2
classifiers = [
              LogisticRegression(random_state = random_state),
              LinearDiscriminantAnalysis(),
              KNeighborsClassifier(),
              SVC(random_state=random_state),
              DecisionTreeClassifier(random_state=random_state),
              RandomForestClassifier(random_state=random_state),
              AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1),
              GradientBoostingClassifier(random_state=random_state),
              MLPClassifier(random_state=random_state)
              ]

# save the cross-validation results
#create table to compare MLA metrics
res_columns = ['CLF Name', 'CLF Parameters','CLF Test Accuracy Mean', 'CLF Test Accuracy STD']
res_compare = pd.DataFrame(columns = res_columns)
res_predict = y_train.copy()

row_index = 0
for CLF in classifiers:
    CLF_name = CLF.__class__.__name__
    res_compare.loc[row_index, 'CLF Name'] = CLF_name
    res_compare.loc[row_index, 'CLF Parameters'] = str(CLF.get_params())
    cv_result = cross_val_score(CLF, X_train, y = y_train, scoring = "accuracy", cv = kfold, n_jobs=2)
    
    res_compare.loc[row_index, 'CLF Test Accuracy Mean'] = cv_result.mean() 
    res_compare.loc[row_index, 'CLF Test Accuracy STD'] = cv_result.std()
    
    # CLF.fit(X_train, y_train)
    # res_predict[CLF_name] = CLF.predict(X_train)
    
    row_index+=1

res_compare.sort_values(by = ['CLF Test Accuracy Mean'], ascending = False, inplace = True)
res_compare
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
Out[29]:
CLF Name CLF Parameters CLF Test Accuracy Mean CLF Test Accuracy STD
3 SVC {'C': 1.0, 'cache_size': 200, 'class_weight': ... 0.835031 0.0188776
1 LinearDiscriminantAnalysis {'n_components': None, 'priors': None, 'shrink... 0.832834 0.0187845
0 LogisticRegression {'C': 1.0, 'class_weight': None, 'dual': False... 0.828346 0.0184794
7 GradientBoostingClassifier {'criterion': 'friedman_mse', 'init': None, 'l... 0.827197 0.0189454
8 MLPClassifier {'activation': 'relu', 'alpha': 0.0001, 'batch... 0.81935 0.0174372
2 KNeighborsClassifier {'algorithm': 'auto', 'leaf_size': 30, 'metric... 0.808171 0.0280628
6 AdaBoostClassifier {'algorithm': 'SAMME.R', 'base_estimator__clas... 0.804769 0.0180252
4 DecisionTreeClassifier {'class_weight': None, 'criterion': 'gini', 'm... 0.800293 0.0196009
5 RandomForestClassifier {'bootstrap': True, 'class_weight': None, 'cri... 0.798046 0.0287012
In [32]:
sns.barplot(x='CLF Test Accuracy Mean', y = 'CLF Name', data = res_compare, palette="Set3")
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x10be8fb70>

Fine-tune the selected models

Let's perform a grid search optimization for AdaBoost, RandomForest, GradientBoosting and SVC classifiers.

In [33]:
# Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=2)

ada_param = [{
    "base_estimator__criterion" : ["gini", "entropy"],
    "base_estimator__splitter" :   ["best", "random"],
    "n_estimators" :[30,100,300],
    "learning_rate":  [0.01, 0.03, 0.1, 0.3]
}]

gs_adaDTC = GridSearchCV(adaDTC,param_grid = ada_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_adaDTC.fit(X_train,y_train)
gs_adaDTC.best_params_
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=2)]: Done  61 tasks      | elapsed:   16.3s
[Parallel(n_jobs=2)]: Done 211 tasks      | elapsed:  1.0min
[Parallel(n_jobs=2)]: Done 240 out of 240 | elapsed:  1.2min finished
Out[33]:
{'base_estimator__criterion': 'entropy',
 'base_estimator__splitter': 'best',
 'learning_rate': 0.3,
 'n_estimators': 100}
In [35]:
# Random Forest
RFC = RandomForestClassifier()

rf_param = [
    {"bootstrap": [False],'n_estimators':[30,100,300],'max_features':[3,10,30]}
]

gs_RFC = GridSearchCV(RFC,param_grid = rf_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_RFC.fit(X_train,y_train)
gs_RFC.best_params_
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=2)]: Done  42 out of  45 | elapsed:    8.2s remaining:    0.6s
[Parallel(n_jobs=2)]: Done  45 out of  45 | elapsed:    9.3s finished
Out[35]:
{'bootstrap': False, 'max_features': 10, 'n_estimators': 30}
In [36]:
# Gradient boosting

GBC = GradientBoostingClassifier()
gb_param = [{
              'n_estimators' : [100,300],
              'learning_rate': [0.01, 0.03, 0.1],
              'max_depth': [3, 10],
              }]

gs_GBC = GridSearchCV(GBC,param_grid = gb_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_GBC.fit(X_train,y_train)
gs_GBC.best_params_
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=2)]: Done  55 tasks      | elapsed:   26.2s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   31.5s finished
Out[36]:
{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 300}
In [37]:
# SVC
SVMC = SVC()
SVM_param = [{
    'kernel': ['rbf'], 
    'gamma': [ 0.01, 0.03, 0.1],
    'C': [1, 3, 10, 30,100],
    'probability': [True]
                 }]

gs_SVM = GridSearchCV(SVMC,param_grid = SVM_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_SVM.fit(X_train,y_train)

gs_SVM.best_params_
Fitting 5 folds for each of 15 candidates, totalling 75 fits
[Parallel(n_jobs=2)]: Done  72 out of  75 | elapsed:    4.2s remaining:    0.2s
[Parallel(n_jobs=2)]: Done  75 out of  75 | elapsed:    4.4s finished
Out[37]:
{'C': 100, 'gamma': 0.01, 'kernel': 'rbf', 'probability': True}

Plot the learning curve

Let's plot the learning curve to see accuracy score against training size. It's a good way to see the overfitting effect.

In [41]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """generate training and testing dataset training curve."""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("the size of Training set")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    # train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    # test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    # plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1,color="r")
    # plt.fill_between(train_sizes, test_scores_mean - test_scores_std,test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Cross-validation score")
    
    
    plt.legend(loc="best")
    return
In [43]:
g = plot_learning_curve(gs_RFC.best_estimator_,"RF mearning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_SVM.best_estimator_,"SVC learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_adaDTC.best_estimator_,"AdaBoost learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_GBC.best_estimator_,"GradientBoosting learning curves",X_train,y_train,cv=kfold)

SVM classifier seems generalize better prediction since the training and testing cross-validation learning curve are close to each other.

Adaboost, Gradient boossting, and Random Forest seems overfitting the training set, since there is a big gap between two curves. We see that the testing learning curve goes up while increasing training set size, so the three algorithms will perform better if we feed them more training data.

Ensemble or Stack the models

Let's use a voting classifier to combine the predictions coming from the 5 classifiers.

I preferred to pass the argument "soft" to the voting parameter to take into account the probability of each vote.

In [44]:
SVM_best = gs_SVM.best_estimator_
RFC_best = gs_RFC.best_estimator_
ada_best = gs_adaDTC.best_estimator_
GBC_best = gs_GBC.best_estimator_
In [45]:
votingC = VotingClassifier(estimators=[('rfc', RFC_best), 
('svc', SVM_best), ('adac',ada_best),('gbc',GBC_best)], voting='soft', n_jobs=2)

votingC = votingC.fit(X_train, y_train)

Predict

In [ ]:
test_Survived = pd.Series(votingC.predict(test), name="Survived")

results = pd.concat([IDtest,test_Survived],axis=1)

results.to_csv("ensemble_python_voting.csv",index=False)
In [ ]: