Titanic Anlysis using Stacking & compare accuracy with using normal sckit-learn

11 minute read

Reference

허민석님의 Blog (https://github.com/minsuk-heo/kaggle-titanic/tree/master)
허민석님의 Youtube (https://www.youtube.com/watch?v=aqp_9HV58Ls)
머신러닝 스태킹 앙상블 (https://lsjsj92.tistory.com/559?category=853217)

Stacking 사용전, 기본 전처리 부분은 위 내용을 참조하였습니다.

import pandas as pd

train = pd.read_csv('../dataset/titanic_train.csv')
test = pd.read_csv('../dataset/titanic_test.csv')

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots

train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

posting 목적상, EDA 부분은 생략합니다. 원본을 참조하시면, 훨씬 훌륭하게 잘 정리되어 있습니다.

1. Feature 추가

train_test_data = [train, test] # combining train and test dataset

for dataset in train_test_data:
    dataset['Title'] = dataset['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

train['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Mlle          2
Major         2
Sir           1
Countess      1
Capt          1
Lady          1
Don           1
Mme           1
Ms            1
Jonkheer      1
Name: Title, dtype: int64

Title map

Mr : 0
Miss : 1
Mrs: 2
Others: 3
categorical 하게 컬럼값을 처리한다.

title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, 
                 "Master": 3, "Dr": 3, "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                 "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3, "Mme": 3,"Capt": 3,"Sir": 3 }
for dataset in train_test_data:
    dataset['Title'] = dataset['Title'].map(title_mapping)

# delete unnecessary feature from dataset
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)

sex_mapping = {"male": 0, "female": 1}
for dataset in train_test_data:
    dataset['Sex'] = dataset['Sex'].map(sex_mapping)

2.Null 처리

# fill missing age with median age for each title (Mr, Mrs, Miss, Others)
train["Age"].fillna(train.groupby("Title")["Age"].transform("median"), inplace=True)
test["Age"].fillna(test.groupby("Title")["Age"].transform("median"), inplace=True)

# train.head(30)
print(train.groupby("Title")["Age"].agg("median"))
train.groupby("Title")["Age"].transform("median")[0:10] ## provide median value associated with Title-Age per each line

Title
  30.0
  21.0
  35.0
   9.0
Name: Age, dtype: float64


  30.0
  35.0
  21.0
  35.0
  30.0
  30.0
  30.0
   9.0
  35.0
  35.0
Name: Age, dtype: float64

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()

<seaborn.axisgrid.FacetGrid at 0x1d4a3025f60>

png

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.xlim(0,20) ## This is an enlarged picture of the above.

(0, 20)

png

Pclass1 = train[train['Pclass']==1]['Embarked'].value_counts()
Pclass2 = train[train['Pclass']==2]['Embarked'].value_counts()
Pclass3 = train[train['Pclass']==3]['Embarked'].value_counts()
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class', '3rd class']
df.plot(kind='bar',stacked=True, figsize=(10,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1d4a514e748>

png

more than 50% of 1st class are from S embark
more than 50% of 2nd class are from S embark
more than 50% of 3rd class are from S embark

fill out missing embark with S embark

for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	0	22.0	1	A/5 21171	7.2500	NaN	S	0
1	2	1	1	1	38.0	1	PC 17599	71.2833	C85	C	2
2	3	1	3	1	26.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	1	1	1	35.0	1	113803	53.1000	C123	S	2
4	5	0	3	0	35.0	0	373450	8.0500	NaN	S	0

embarked_mapping = {"S": 0, "C": 1, "Q": 2}
for dataset in train_test_data:
    dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)

3. Feature engineering

for dataset in train_test_data:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0,
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 26), 'Age'] = 1,
    dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 36), 'Age'] = 2,
    dataset.loc[(dataset['Age'] > 36) & (dataset['Age'] <= 62), 'Age'] = 3,
    dataset.loc[ dataset['Age'] > 62, 'Age'] = 4

# fill missing Fare with median fare for each Pclass
train["Fare"].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace=True)
test["Fare"].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace=True)
train.head(5)

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	0	1.0	1	A/5 21171	7.2500	NaN	0	0
1	2	1	1	1	3.0	1	PC 17599	71.2833	C85	1	2
2	3	1	3	1	1.0	0	STON/O2. 3101282	7.9250	NaN	0	1
3	4	1	1	1	2.0	1	113803	53.1000	C123	0	2
4	5	0	3	0	2.0	0	373450	8.0500	NaN	0	0

for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].str[:1]

Pclass1 = train[train['Pclass']==1]['Cabin'].value_counts()
Pclass2 = train[train['Pclass']==2]['Cabin'].value_counts()
Pclass3 = train[train['Pclass']==3]['Cabin'].value_counts()
df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
df.index = ['1st class','2nd class', '3rd class']
df.plot(kind='bar',stacked=True, figsize=(10,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1d4a50e97f0>

png

cabin_mapping = {"A": 0, "B": 0.4, "C": 0.8, "D": 1.2, "E": 1.6, "F": 2, "G": 2.4, "T": 2.8}
for dataset in train_test_data:
    dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)

# fill missing Fare with median fare for each Pclass
train["Cabin"].fillna(train.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
test["Cabin"].fillna(test.groupby("Pclass")["Cabin"].transform("median"), inplace=True)

train["FamilySize"] = train["SibSp"] + train["Parch"] + 1
test["FamilySize"] = test["SibSp"] + test["Parch"] + 1

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'FamilySize',shade= True)
facet.set(xlim=(0, train['FamilySize'].max()))
facet.add_legend()
plt.xlim(0)

(0, 11.0)

png

family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
for dataset in train_test_data:
    dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)

train.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	FamilySize
0	1	0	3	0	1.0	1	A/5 21171	7.2500	2.0	0	0	0.4
1	2	1	1	1	3.0	1	PC 17599	71.2833	0.8	1	2	0.4
2	3	1	3	1	1.0	0	STON/O2. 3101282	7.9250	2.0	0	1	0.0
3	4	1	1	1	2.0	1	113803	53.1000	0.8	0	2	0.4
4	5	0	3	0	2.0	0	373450	8.0500	2.0	0	0	0.0

features_drop = ['Ticket', 'SibSp', 'Parch']
train = train.drop(features_drop, axis=1)
test = test.drop(features_drop, axis=1)
train = train.drop(['PassengerId'], axis=1)

Find Outlier && Remvoe Them

Using IQR : Inter Quntile Range (https://lsjsj92.tistory.com/556?category=853217)

백분위 수 : 데이터를 백등분 한 것
사분위 수 : 데이터를 4등분 한 것
중위수 : 데이터의 정 가운데 순위에 해당하는 값.(관측치의 절반은 크거나 같고 나머지 절반은 작거나 같다.)
제 3사분위 수 (Q3) : 중앙값 기준으로 상위 50% 중의 중앙값, 전체 데이터 중 상위 25%에 해당하는 값
제 1사분위 수 (Q1) : 중앙값 기준으로 하위 50% 중의 중앙값, 전체 데이터 중 하위 25%에 해당하는 값
사분위 범위 수(IQR) : 데이터의 중간 50% (Q3 - Q1)

Top 25%:Q3 ~ Top 75%:Q1 so, IQR = Q3-Q1
Limit max value = Q3 + IQR*1.5 (usually using 1.5)
Limit min value = Q1 - IQR*1.5

## train 에서, Fare 만이 numeric 이라 볼 수 있다. FamilySize 는 가공변수고, 이미 order categori value 화 되었다. 다른 것도 마찬가지
# train.Fare
facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Fare',shade= True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
# plt.xlim(0)

<seaborn.axisgrid.FacetGrid at 0x1d4a6311e80>

png

facet = sns.FacetGrid(train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Fare',shade= True)
facet.set(xlim=(0, train['Fare'].max()))
facet.add_legend()
plt.xlim(0,60)

(0, 60)

png

Fare feature looks having correlation with Surviced. so Check the outlier!!

train.Fare.describe().apply(lambda x : "{:.4f}".format(x))

count    891.0000
mean      32.2042
std       49.6934
min        0.0000
25%        7.9104
50%       14.4542
75%       31.0000
max      512.3292
Name: Fare, dtype: object

def ourlier_index(df,column,p):
    q3 = df[column].quantile(0.75)
    q1 = df[column].quantile(0.25)
    iqr = q3 - q1
    max_limit_val = q3+(iqr*p)
    min_limit_val = 0 if q1-(iqr*p) < 0 else q1-(iqr*p)
    a = (min_limit_val,max_limit_val)
    print("min_limit_val {}".format(a[0]),"\t","max_limit_val {}".format(a[1]))
    ix = df.loc[df.Fare < a[0]].index | df.loc[df.Fare > a[1]].index
    left_ix = set(df.index)-set(ix)
    return left_ix

train.groupby(by='Survived').Fare.describe().stack().apply(lambda x : "{:.4f}".format(x))

Survived       
0         count    549.0000
          mean      22.1179
          std       31.3882
          min        0.0000
          25%        7.8542
          50%       10.5000
          75%       26.0000
          max      263.0000
1         count    342.0000
          mean      48.3954
          std       66.5970
          min        0.0000
          25%       12.4750
          50%       26.0000
          75%       57.0000
          max      512.3292
dtype: object

survived = 1,0 에 따라 아웃라이어가 다를 수 있으니, 다르게 접근하여 아웃라이어를 처리한다.

## when you want to find outlier and remove outlier value, run this code~

left_survive_ix = ourlier_index(train[train.loc[:,"Survived"]==1],"Fare",2.5)
left_die_ix = ourlier_index(train[train.loc[:,"Survived"]==0],"Fare",2.5)

print(len(left_survive_ix),len(left_die_ix))
left_survive_ix.intersection(left_die_ix) ## no dup index

train_01 = pd.concat([train.iloc[list(left_survive_ix)],train.iloc[list(left_die_ix)]],axis=0)
train_01.reset_index(drop=True,inplace=True)
print(train_01.shape)
train_01.head()

## when you don't want to find outlier and remove outlier value, run this code~
train_01 = train.copy()

train_data = train_01.drop('Survived', axis=1)
target = train_01['Survived']

train_data.shape, target.shape

((891, 8), (891,))

print(train_data.shape)
train_data.head(5)

(891, 8)

	Pclass	Sex	Age	Fare	Cabin	Embarked	Title	FamilySize
0	3	0	1.0	7.2500	2.0	0	0	0.4
1	1	1	3.0	71.2833	0.8	1	2	0.4
2	3	1	1.0	7.9250	2.0	0	1	0.0
3	1	1	2.0	53.1000	0.8	0	2	0.4
4	3	0	2.0	8.0500	2.0	0	0	0.0

total_y_sum = target.value_counts().sort_values(axis=0,ascending=False).sum()
target.value_counts().sort_values(axis=0,ascending=False)/total_y_sum

0    0.616162
1    0.383838
Name: Survived, dtype: float64

train_data = train_data.astype('float64')

Scaling

from sklearn.preprocessing import StandardScaler,MinMaxScaler

scaler = StandardScaler()
scaler.fit(train_data)
x_scaled = scaler.transform(train_data)

Modeling

# Importing Classifier Modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import numpy as np

Modeling - Cross Validation (K-fold)

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

Modeling - kNN

clf = KNeighborsClassifier(n_neighbors = 13)
scoring = 'accuracy'
score = cross_val_score(clf, x_scaled, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.83333333 0.7752809  0.80898876 0.82022472 0.83146067 0.83146067
 0.76404494 0.84269663 0.82022472 0.82022472]

# kNN Score
round(np.mean(score)*100, 2)

81.48

Modeling - Decision Tree

clf = DecisionTreeClassifier()
scoring = 'accuracy'
score = cross_val_score(clf, x_scaled, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.81111111 0.78651685 0.83146067 0.7752809  0.79775281 0.80898876
 0.80898876 0.7752809  0.7752809  0.80898876]

# decision tree Score
round(np.mean(score)*100, 2)

79.8

Modeling - Ramdom Forest

clf = RandomForestClassifier(n_estimators=13)
scoring = 'accuracy'
score = cross_val_score(clf, x_scaled, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.76666667 0.83146067 0.84269663 0.76404494 0.83146067 0.78651685
 0.83146067 0.79775281 0.75280899 0.82022472]

# Random Forest Score
round(np.mean(score)*100, 2)

80.25

Modeling - Naive Bayes

clf = GaussianNB()
scoring = 'accuracy'
score = cross_val_score(clf, x_scaled, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.82222222 0.74157303 0.7752809  0.75280899 0.74157303 0.83146067
 0.80898876 0.82022472 0.82022472 0.83146067]

# Naive Bayes Score
round(np.mean(score)*100, 2)

79.46

Modeling - SVM

clf = SVC(gamma='auto')
scoring = 'accuracy'
score = cross_val_score(clf, x_scaled, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)

[0.82222222 0.82022472 0.78651685 0.82022472 0.84269663 0.83146067
 0.83146067 0.82022472 0.79775281 0.80898876]

round(np.mean(score)*100,2)

81.82

Stacking-Ensemble using K-Fold

하기 코드 이해를 위해, 몇가지 사전 연습

test_fold = np.random.randint(0,100,size=(10,3))
print(test_fold.shape)

(10, 3)

test_fold

array([[38, 79, 89],
       [ 2, 66, 25],
       [95, 56, 95],
       [75,  2, 87],
       [10, 34, 63],
       [56, 46, 78],
       [33, 38, 15],
       [14, 82, 24],
       [10, 63, 85],
       [20, 23, 59]])

kk = KFold(n_splits=5, shuffle=False, random_state=42)
print("kk.get_n_splits :",kk.get_n_splits(test_fold))
for cnt , (train_idx,valid_idx) in enumerate(kk.split(test_fold)):
    print("{} 번째 fold ".format(cnt),"폴더내, train 인덱스 {}".format(train_idx),"폴더내, train 인덱스 {}".format(valid_idx))

kk.get_n_splits : 5
번째 fold  폴더내, train 인덱스 [2 3 4 5 6 7 8 9] 폴더내, train 인덱스 [0 1]
번째 fold  폴더내, train 인덱스 [0 1 4 5 6 7 8 9] 폴더내, train 인덱스 [2 3]
번째 fold  폴더내, train 인덱스 [0 1 2 3 6 7 8 9] 폴더내, train 인덱스 [4 5]
번째 fold  폴더내, train 인덱스 [0 1 2 3 4 5 8 9] 폴더내, train 인덱스 [6 7]
번째 fold  폴더내, train 인덱스 [0 1 2 3 4 5 6 7] 폴더내, train 인덱스 [8 9]

여러개의 모델을 학습하고, 최종모델을 만들때, K-Fold에 기반하여 학습셋을 여러번 조직한다. 이때 편리하게 활용하기 위하여 별도 함수를 만들어 get_stacking_data() 관리한다

def get_stacking_data(model,x_train,y_train,x_test,n_folds=5):
    kfold = KFold(n_splits=n_folds, random_state=42)
    train_fold_predict = np.zeros((x_train.shape[0],1)) ## 폴드 횟수별, 검증 셋의 결과를 한줄로 담기 위한 shape
    test_predict = np.zeros((x_test.shape[0],n_folds))
    print("model : {}".format(model.__class__.__name__))
    
    for cnt , (train_idx,valid_idx) in enumerate(kfold.split(x_train)):
        x_train_ = x_train[train_idx]
        y_train_ = y_train[train_idx]
        x_val = x_train[valid_idx]
        
        ##학습
        model.fit(x_train_,y_train_)
        
        #해당 폴드에서 학습된 모델에다가 검증데이터로 예측 후 저장 이는 나중에 합쳐져서(쌓여서) new_x_train 이 된다.
        train_fold_predict[valid_idx,:] = model.predict(x_val).reshape(-1,1)
        
        #해당 폴드에서 생성된 모델에게 원본 테스트 데이터(x_test)를 이용해서 예측하고 저장
        test_predict[:,cnt] = model.predict(x_test)
    
    #for문 이후, test_predict 값은 평균을 내어서, 하나로 합친다.
    test_predict_mean = np.mean(test_predict,axis=1).reshape(-1,1)
    
    return train_fold_predict,test_predict_mean

clf_knn = KNeighborsClassifier(n_neighbors = 13)
clf_dt = DecisionTreeClassifier()
clf_rf = RandomForestClassifier(n_estimators=13)
clf_nb = GaussianNB()
clf_svm = SVC(gamma='auto')

staking_knn_train,staking_knn_test = get_stacking_data(clf_knn,x_scaled,target,xtest_scaled,n_folds=10)
staking_dt_train,staking_dt_test = get_stacking_data(clf_dt,x_scaled,target,xtest_scaled,n_folds=10)
staking_rf_train,staking_rf_test = get_stacking_data(clf_rf,x_scaled,target,xtest_scaled,n_folds=10)
staking_nb_train,staking_nb_test = get_stacking_data(clf_nb,x_scaled,target,xtest_scaled,n_folds=10)
staking_svm_train,staking_svm_test = get_stacking_data(clf_svm,x_scaled,target,xtest_scaled,n_folds=10)

model : KNeighborsClassifier
model : DecisionTreeClassifier
model : RandomForestClassifier
model : GaussianNB
model : SVC

print(staking_knn_train.shape,staking_knn_test.shape)
print(staking_dt_train.shape,staking_dt_test.shape)
print(staking_rf_train.shape,staking_rf_test.shape)
print(staking_nb_train.shape,staking_nb_test.shape)
print(staking_svm_train.shape,staking_svm_test.shape)

(891, 1) (418, 1)
(891, 1) (418, 1)
(891, 1) (418, 1)
(891, 1) (418, 1)
(891, 1) (418, 1)

new_x_train = np.concatenate([staking_knn_train,staking_dt_train,staking_rf_train,staking_nb_train,staking_svm_train],axis=1)
new_x_test = np.concatenate([staking_knn_test,staking_dt_test,staking_rf_test,staking_nb_test,staking_svm_test],axis=1)

print("origin x_train.shape {}".format(x_scaled.shape))
print("for stacking shape new_x_train.shape {}".format(new_x_train.shape)) ## total count of each clf is 5  
print("for stackgin shape new_x_test.shape {}".format(new_x_test.shape))

origin x_train.shape (891, 8)
for stacking shape new_x_train.shape (891, 5)
for stackgin shape new_x_test.shape (418, 5)

shape 모양으로 stacking 의 구조를 잘 이해해야 한다.

stacking 모델은 개별 모델의 결과값을 feature 로 삼는데, k-fold 로 staking 방법을 사용하게 되면,

개별 분류기의 k-flod 검증 값들에 대한 예측값이 stacking train set = new_train 셋의 feature 1개가 된다.

여기선, 폴드가 10 이었기 때문에, (x_train[0]/10) 10조각으로 나누어져 validation 결과값이 만들어졌고, 이를 합치면, 모든 train 데이터의 validation 결과값 (x_train[0]) 나온다.
이런 값들이 모여서 stacking의 new_trainset이 되고, 여기선 개별분류기가 5개이니 최종 5개의 feature로 된 트렌인 셋이 만들어진다. -> (891,5)
staking 의 new_x_test 는, 1개의 개별 분류기에서 10조각으로 나누어진 검증셋별로의 학습결과모델로 x_test 를 예측한 이후 이를 평균으로 만들고,
이를 합쳐서, 5개의 feature를 가진 new_x_test 가 된다. -> (418,5)

Testing

가장 잘 나온 SVM 모델로 구한값 – didn’t use stacking

clf = SVC(gamma='auto')
clf.fit(train_data, target)

test_data = test.drop("PassengerId", axis=1).copy()
test_data = test_data.astype('float64')

xtest_scaled = scaler.transform(test_data)

prediction = clf.predict(xtest_scaled) ## kaggle 제출결과 0.65071 outlier 처리 하기전이 더 나은듯. 과적합 된것으로 보인다.

참고로, 상기 원본 blog 자료에서, 추가로 야심차게 outlier 제거하고, scaling 변환 처리를 추가하여 나온 값이….원본보다, kaggle 결과가 더 나쁘다.

원본 blog : 0.78947
outlier + scaling 처리 : 0.65071
outlier (IQR변환 더크게) + scaling 처리 : 0.62679

lightgbm 모델로 구한값 – didn’t use stacking

from lightgbm import LGBMClassifier,plot_importance

lgb = LGBMClassifier(n_estimators = 400)
clf_lgb = lgb
clf_lgb.fit(train_data, target)

scoring = 'accuracy'
score = cross_val_score(clf_lgb, x_scaled, target, cv=k_fold, n_jobs=1, scoring=scoring)
print(score)
print(round(np.mean(score)*100,2))

[0.78888889 0.84269663 0.80898876 0.76404494 0.87640449 0.82022472
 0.79775281 0.82022472 0.76404494 0.82022472]
81.03

test_data = test.drop("PassengerId", axis=1).copy()
test_data = test_data.astype('float64')
xtest_scaled = scaler.transform(test_data)

prediction = clf_lgb.predict(xtest_scaled) ## kaggle 제출결과 0.78468

submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": prediction
    })

submission.to_csv('../dataset/submission.csv', index=False)

stacking 결과로 구하기

stacking_lgb = LGBMClassifier(n_estimators = 400)
stacking_lgb.fit(new_x_train,target)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
        importance_type='split', learning_rate=0.1, max_depth=-1,
        min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
        n_estimators=400, n_jobs=-1, num_leaves=31, objective=None,
        random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
        subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

prediction = stacking_lgb.predict(new_x_test)

submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": prediction
    })

submission.to_csv('../dataset/submission.csv', index=False) ## 0.79425 달성!!

# submission = pd.read_csv('submission.csv')
# submission.head()

Twitter Facebook LinkedIn

취미로 먹고 살고싶은 IT 개발자