Preliminary analysis and Feature engineering¶

The main goal is to predict whether the passenger survived based on attributes such as age, sex, class, and where they embarked and so on.

import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

TITANIC_PATH = os.path.join("/Users/yanghehuo/Documents/coursra/ML/datasets","Titan")

import pandas as pd
def read_data(file_name, path):
    csv_path = os.path.join(path, file_name)
    return pd.read_csv(csv_path)
train_data = read_data("Titan.csv",TITANIC_PATH)
test_data = read_data("test.csv",TITANIC_PATH)

Let's take a peek at the top few row of the dataset

train_data.head()

The attributes have following meaning:

PassengerId: unique identifier of a passenger.
Servived: Target variable. It contains two values, 0 and 1. 0 means the passenger didn't servive, 1 means the passenger survived.
Pclass: indicates the ticket's class. 1 = 1st, 2 = 2nd, 3 = 3rd
Name, Sex, Age: self_explanatory
SibSp: number of siblings / spouses aboard the Titanic
Parch: number of parents / children aboard the Titanic
Ticket:Ticket number
Fare: Price paid
Cabin: Cabin number
Embarked: where the passenger embarded on Titanic. C = Cherbourg, Q = Queenstown, S = Southampton

Let's get more informaion to see how much data is missing

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

The Cabin, Age, and Embarked has some missing values. Especially Cabin 77% are null. We will ignore it for now and focus on others. The Age attribute has about 19% null values, replacing the null values with the median seems promissing.

Name and Ticket variables are hard to convert to useful numbers that the algorithm can consume. So we may ignore them for now as well.

Let's look at the numerical variables:

train_data.describe()

38% passengers survived, which is close to 50%. So accuracy can be used as a performance measure.

Let's see whether the Survived variables takes value 0 and 1.

train_data['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

train_data['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, no need to try to
# understand every line.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    """Encode categorical features as a numeric array.
    The input to this transformer should be a matrix of integers or strings,
    denoting the values taken on by categorical (discrete) features.
    The features can be encoded using a one-hot aka one-of-K scheme
    (``encoding='onehot'``, the default) or converted to ordinal integers
    (``encoding='ordinal'``).
    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.
    Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
    Parameters
    ----------
    encoding : str, 'onehot', 'onehot-dense' or 'ordinal'
        The type of encoding to use (default is 'onehot'):
        - 'onehot': encode the features using a one-hot aka one-of-K scheme
          (or also called 'dummy' encoding). This creates a binary column for
          each category and returns a sparse matrix.
        - 'onehot-dense': the same as 'onehot' but returns a dense array
          instead of a sparse matrix.
        - 'ordinal': encode the features as ordinal integers. This results in
          a single column of integers (0 to n_categories - 1) per feature.
    categories : 'auto' or a list of lists/arrays of values.
        Categories (unique values) per feature:
        - 'auto' : Determine categories automatically from the training data.
        - list : ``categories[i]`` holds the categories expected in the ith
          column. The passed categories are sorted before encoding the data
          (used categories can be found in the ``categories_`` attribute).
    dtype : number type, default np.float64
        Desired dtype of output.
    handle_unknown : 'error' (default) or 'ignore'
        Whether to raise an error or ignore if a unknown categorical feature is
        present during transform (default is to raise). When this is parameter
        is set to 'ignore' and an unknown category is encountered during
        transform, the resulting one-hot encoded columns for this feature
        will be all zeros.
        Ignoring unknown categories is not supported for
        ``encoding='ordinal'``.
    Attributes
    ----------
    categories_ : list of arrays
        The categories of each feature determined during fitting. When
        categories were specified manually, this holds the sorted categories
        (in order corresponding with output of `transform`).
    Examples
    --------
    Given a dataset with three features and two samples, we let the encoder
    find the maximum value per feature and transform the data to a binary
    one-hot encoding.
    >>> from sklearn.preprocessing import CategoricalEncoder
    >>> enc = CategoricalEncoder(handle_unknown='ignore')
    >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
    ... # doctest: +ELLIPSIS
    CategoricalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
              encoding='onehot', handle_unknown='ignore')
    >>> enc.transform([[0, 1, 1], [1, 0, 4]]).toarray()
    array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.],
           [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
    See also
    --------
    sklearn.preprocessing.OneHotEncoder : performs a one-hot encoding of
      integer ordinal features. The ``OneHotEncoder assumes`` that input
      features take on values in the range ``[0, max(feature)]`` instead of
      using the unique values.
    sklearn.feature_extraction.DictVectorizer : performs a one-hot encoding of
      dictionary items (also handles string-valued features).
    sklearn.feature_extraction.FeatureHasher : performs an approximate one-hot
      encoding of dictionary items or strings.
    """

    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):
        """Fit the CategoricalEncoder to X.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
            The data to determine the categories of each feature.
        Returns
        -------
        self
        """

        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
            Transformed input.
        """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out

Feature Enginnering¶

Try adding more promising features in to dataset, for example:
- replace SibSp and Parch with their sum,
- try to identify parts of names that correlate well with the Survived attribute (e.g. if the name contains "Countess", then survival seems more likely),
- extract the first letter of Cabin as the Desk, if missing, put 'X'
Try to convert numerical attributes to categorical attributes: for example, different age groups had very different survival rates (see below), so it may help to create an age bucket category and use it instead of the age.
Similarly, it may be useful to have a special category for people traveling alone since only 30% of them survived (see below).

# AgeBucket
train_data["AgeBucket"] = train_data["Age"] // 15 * 15
train_data[["AgeBucket", "Survived"]].groupby(['AgeBucket']).mean()

# relatives onboard
train_data["RelativesOnboard"] = train_data["SibSp"] + train_data["Parch"]
train_data[["RelativesOnboard", "Survived"]].groupby(['RelativesOnboard']).mean()

# extract title from Name attribtute
train_data['Title'] = train_data['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
train_data['Title'].value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Sir               1
Lady              1
Don               1
Ms                1
the Countess      1
Mme               1
Jonkheer          1
Capt              1
Name: Title, dtype: int64

train_data["Title"] = train_data["Title"].replace(['Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Others')
train_data["Title"] = train_data["Title"].replace(["Miss","Ms", "Mme", "Mlle","Mrs", 'Countess','Lady', 'the Countess'],'Mme')
train_data['Title'].value_counts()

Mr        517
Mme       313
Master     40
Others     21
Name: Title, dtype: int64

# create a new feature called "IsAlone"
train_data['IsAlone'] = 0
train_data.loc[train_data['RelativesOnboard'] == 0, 'IsAlone'] = 1

train_data[["IsAlone", "Survived"]].groupby(['IsAlone']).mean()

train_data.head()

# extract the Desk from the Cabin
train_data["Cabin"][train_data["Cabin"].notnull()].head()

1      C85
3     C123
6      E46
10      G6
11    C103
Name: Cabin, dtype: object

# Replace the Cabin number by the type of cabin 'X' if not
train_data["Desk"] = pd.Series([i[0] if not pd.isnull(i) else 'X' for i in train_data['Cabin'] ])

train_data.head()

Data preparation pipeline¶

Build the data preparation pipeline.

DataFrameSelector: select specific attributes from data frame
Imputer for categorical variabes (by replacing the missing value with the most frequent value
Imputer for numerical variables (available in sklearn)
Scalar (available in sklearn)

# Data Frame Selector

from sklearn.base import BaseEstimator, TransformerMixin

# A class to select numerical or categorical columns 
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

# A class to impute the categorical variable, by replacing the missing value with the most frequent value

class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent = pd.Series([X[c].value_counts().index[0] for c in X.columns]
                                       ,index = X.columns)
        return self
    def transform(self, X):
        return X.fillna(self.most_frequent)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

num_attributes = ['RelativesOnboard','Fare']
cat_attributes = ['Pclass','Sex','Embarked','AgeBucket','Title','IsAlone','Desk'] # 3, 2, 3, 6, 5, 2, 9

num_pipeline = Pipeline([
    ("DataFrameSelector",DataFrameSelector(num_attributes)),
    ("Imputer",Imputer(strategy='median')),
    ("std_scaler",StandardScaler())
])

cat_pipeline = Pipeline([
    ("DataFrameSelector",DataFrameSelector(cat_attributes)),
    ("MostFrequentImputer",MostFrequentImputer()),
    ("cat_encoder", CategoricalEncoder(encoding='onehot-dense'))
])

Combine the two pipelines

from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(transformer_list = [
    ('num_pipeline',num_pipeline),
    ('cat_pipeline',cat_pipeline)
])

X_train = preprocess_pipeline.fit_transform(train_data)
X_train.shape

(891, 31)

# get the label (target)
y_train = train_data["Survived"]

Modeling¶

Select and train a model¶

I compared 9 popular classifiers and evaluate the mean accuracy of each of them using a stratified kfold cross validation procedure.

Logistic regression
Linear Discriminant Analysis
KNN
SVM
Decision Tree
Random Forest
AdaBoost
Gradient Boosting
neural network (Multiple layer perceprton)

#Common Model Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier

#Common Model Helpers
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve, cross_validate

#

# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=5)

# Initialize different algorithms with default parameters
random_state = 2
classifiers = [
              LogisticRegression(random_state = random_state),
              LinearDiscriminantAnalysis(),
              KNeighborsClassifier(),
              SVC(random_state=random_state),
              DecisionTreeClassifier(random_state=random_state),
              RandomForestClassifier(random_state=random_state),
              AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1),
              GradientBoostingClassifier(random_state=random_state),
              MLPClassifier(random_state=random_state)
              ]

# save the cross-validation results
#create table to compare MLA metrics
res_columns = ['CLF Name', 'CLF Parameters','CLF Test Accuracy Mean', 'CLF Test Accuracy STD']
res_compare = pd.DataFrame(columns = res_columns)
res_predict = y_train.copy()

row_index = 0
for CLF in classifiers:
    CLF_name = CLF.__class__.__name__
    res_compare.loc[row_index, 'CLF Name'] = CLF_name
    res_compare.loc[row_index, 'CLF Parameters'] = str(CLF.get_params())
    cv_result = cross_val_score(CLF, X_train, y = y_train, scoring = "accuracy", cv = kfold, n_jobs=2)
    
    res_compare.loc[row_index, 'CLF Test Accuracy Mean'] = cv_result.mean() 
    res_compare.loc[row_index, 'CLF Test Accuracy STD'] = cv_result.std()
    
    # CLF.fit(X_train, y_train)
    # res_predict[CLF_name] = CLF.predict(X_train)
    
    row_index+=1

res_compare.sort_values(by = ['CLF Test Accuracy Mean'], ascending = False, inplace = True)
res_compare

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

sns.barplot(x='CLF Test Accuracy Mean', y = 'CLF Name', data = res_compare, palette="Set3")

<matplotlib.axes._subplots.AxesSubplot at 0x10be8fb70>

Fine-tune the selected models¶

Let's perform a grid search optimization for AdaBoost, RandomForest, GradientBoosting and SVC classifiers.

# Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=2)

ada_param = [{
    "base_estimator__criterion" : ["gini", "entropy"],
    "base_estimator__splitter" :   ["best", "random"],
    "n_estimators" :[30,100,300],
    "learning_rate":  [0.01, 0.03, 0.1, 0.3]
}]

gs_adaDTC = GridSearchCV(adaDTC,param_grid = ada_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_adaDTC.fit(X_train,y_train)
gs_adaDTC.best_params_

Fitting 5 folds for each of 48 candidates, totalling 240 fits

[Parallel(n_jobs=2)]: Done  61 tasks      | elapsed:   16.3s
[Parallel(n_jobs=2)]: Done 211 tasks      | elapsed:  1.0min
[Parallel(n_jobs=2)]: Done 240 out of 240 | elapsed:  1.2min finished

{'base_estimator__criterion': 'entropy',
 'base_estimator__splitter': 'best',
 'learning_rate': 0.3,
 'n_estimators': 100}

# Random Forest
RFC = RandomForestClassifier()

rf_param = [
    {"bootstrap": [False],'n_estimators':[30,100,300],'max_features':[3,10,30]}
]

gs_RFC = GridSearchCV(RFC,param_grid = rf_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_RFC.fit(X_train,y_train)
gs_RFC.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits

[Parallel(n_jobs=2)]: Done  42 out of  45 | elapsed:    8.2s remaining:    0.6s
[Parallel(n_jobs=2)]: Done  45 out of  45 | elapsed:    9.3s finished

{'bootstrap': False, 'max_features': 10, 'n_estimators': 30}

# Gradient boosting

GBC = GradientBoostingClassifier()
gb_param = [{
              'n_estimators' : [100,300],
              'learning_rate': [0.01, 0.03, 0.1],
              'max_depth': [3, 10],
              }]

gs_GBC = GridSearchCV(GBC,param_grid = gb_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_GBC.fit(X_train,y_train)
gs_GBC.best_params_

Fitting 5 folds for each of 12 candidates, totalling 60 fits

[Parallel(n_jobs=2)]: Done  55 tasks      | elapsed:   26.2s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   31.5s finished

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 300}

# SVC
SVMC = SVC()
SVM_param = [{
    'kernel': ['rbf'], 
    'gamma': [ 0.01, 0.03, 0.1],
    'C': [1, 3, 10, 30,100],
    'probability': [True]
                 }]

gs_SVM = GridSearchCV(SVMC,param_grid = SVM_param, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)

gs_SVM.fit(X_train,y_train)

gs_SVM.best_params_

Fitting 5 folds for each of 15 candidates, totalling 75 fits

[Parallel(n_jobs=2)]: Done  72 out of  75 | elapsed:    4.2s remaining:    0.2s
[Parallel(n_jobs=2)]: Done  75 out of  75 | elapsed:    4.4s finished

{'C': 100, 'gamma': 0.01, 'kernel': 'rbf', 'probability': True}

Plot the learning curve¶

Let's plot the learning curve to see accuracy score against training size. It's a good way to see the overfitting effect.

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """generate training and testing dataset training curve."""
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("the size of Training set")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    # train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    # test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    # plt.fill_between(train_sizes, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.1,color="r")
    # plt.fill_between(train_sizes, test_scores_mean - test_scores_std,test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Cross-validation score")
    
    
    plt.legend(loc="best")
    return

g = plot_learning_curve(gs_RFC.best_estimator_,"RF mearning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_SVM.best_estimator_,"SVC learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_adaDTC.best_estimator_,"AdaBoost learning curves",X_train,y_train,cv=kfold)
g = plot_learning_curve(gs_GBC.best_estimator_,"GradientBoosting learning curves",X_train,y_train,cv=kfold)

SVM classifier seems generalize better prediction since the training and testing cross-validation learning curve are close to each other.

Adaboost, Gradient boossting, and Random Forest seems overfitting the training set, since there is a big gap between two curves. We see that the testing learning curve goes up while increasing training set size, so the three algorithms will perform better if we feed them more training data.

Ensemble or Stack the models¶

Let's use a voting classifier to combine the predictions coming from the 5 classifiers.

I preferred to pass the argument "soft" to the voting parameter to take into account the probability of each vote.

SVM_best = gs_SVM.best_estimator_
RFC_best = gs_RFC.best_estimator_
ada_best = gs_adaDTC.best_estimator_
GBC_best = gs_GBC.best_estimator_

votingC = VotingClassifier(estimators=[('rfc', RFC_best), 
('svc', SVM_best), ('adac',ada_best),('gbc',GBC_best)], voting='soft', n_jobs=2)

votingC = votingC.fit(X_train, y_train)

Predict¶

test_Survived = pd.Series(votingC.predict(test), name="Survived")

results = pd.concat([IDtest,test_Survived],axis=1)

results.to_csv("ensemble_python_voting.csv",index=False)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Survived
AgeBucket
0.0	0.576923
15.0	0.362745
30.0	0.423256
45.0	0.404494
60.0	0.240000
75.0	1.000000

	Survived
RelativesOnboard
0	0.303538
1	0.552795
2	0.578431
3	0.724138
4	0.200000
5	0.136364
6	0.333333
7	0.000000
10	0.000000

	CLF Name	CLF Parameters	CLF Test Accuracy Mean	CLF Test Accuracy STD
3	SVC	{'C': 1.0, 'cache_size': 200, 'class_weight': ...	0.835031	0.0188776
1	LinearDiscriminantAnalysis	{'n_components': None, 'priors': None, 'shrink...	0.832834	0.0187845
0	LogisticRegression	{'C': 1.0, 'class_weight': None, 'dual': False...	0.828346	0.0184794
7	GradientBoostingClassifier	{'criterion': 'friedman_mse', 'init': None, 'l...	0.827197	0.0189454
8	MLPClassifier	{'activation': 'relu', 'alpha': 0.0001, 'batch...	0.81935	0.0174372
2	KNeighborsClassifier	{'algorithm': 'auto', 'leaf_size': 30, 'metric...	0.808171	0.0280628
6	AdaBoostClassifier	{'algorithm': 'SAMME.R', 'base_estimator__clas...	0.804769	0.0180252
4	DecisionTreeClassifier	{'class_weight': None, 'criterion': 'gini', 'm...	0.800293	0.0196009
5	RandomForestClassifier	{'bootstrap': True, 'class_weight': None, 'cri...	0.798046	0.0287012

	Survived
IsAlone
0	0.505650
1	0.303538