Handing Missing Value usin KNN

8 minute read

Missing Value 를 처리하는 방식은 늘 DS 직군의 사람들에게는 숙제와 같다.
이를 간단히도 처리할수 있지만, 좀더 나은 방법으로 처리하기위해 여러 도전을 했고, 오늘은 그 중 KNN 방법에 대해서, 알아본다.

Reference

import numpy as np
import pandas as pd
from collections import defaultdict
from scipy.stats import hmean
from scipy.spatial.distance import cdist
from scipy import stats
import numbers
import warnings
warnings.filterwarnings("ignore")
train = pd.read_csv('../dataset/titanic_train.csv')
test = pd.read_csv('../dataset/titanic_test.csv')
# train = pd.read_csv('../ML_Area/data_source/titanic_train.csv')
# test = pd.read_csv('../ML_Area/data_source/titanic_test.csv')
# ML_Area\data_source

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots

pd.concat([train.isnull().sum(),train.dtypes],axis=1,names=[['null','dtype']])

	0	1
PassengerId	0	int64
Survived	0	int64
Pclass	0	int64
Name	0	object
Sex	0	object
Age	177	float64
SibSp	0	int64
Parch	0	int64
Ticket	0	object
Fare	0	float64
Cabin	687	object
Embarked	2	object

‘Age’ : numeric 변수
‘Cabin’,’Embarked : categorical 변수

# !python -m pip install scikit-learn==0.23.1
# !pip show version scikit-learn

1. Scikit-Learn KNNImputer

import sklearn
sklearn.show_versions()

System:
    python: 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)]
executable: C:\ProgramData\Anaconda3\envs\test\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
          pip: 19.0.3
   setuptools: 41.0.0
      sklearn: 0.23.1
        numpy: 1.16.4
        scipy: 1.2.1
       Cython: None
       pandas: 0.24.2
   matplotlib: 3.0.3
       joblib: 0.15.1
threadpoolctl: 2.0.0

Built with OpenMP: True

## scikit-learn : 0.23.1 이상부터 가능
from sklearn.impute import KNNImputer

## 'Age','Cabin'
train.iloc[10:20]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C

KNNImputer 는 오로지 뉴메릭변수에서만, 활용가능하다.

not_cat_df = train[[col for col in train.columns if col not in ['Cabin','Embarked']]].copy() ## Cabin 제외한 df

int_col = train.columns[train.dtypes != 'object'].to_list()
cat_col = train.columns[train.dtypes == 'object'].to_list()

imputer = KNNImputer(n_neighbors=5)
# df_filled = imputer.fit_transform(not_cat_df) ## parameter df 에 카테고리변수가 있으면 무조건 에러난다. 즉. 카테고리칼 변수는 KNN 으로 missing value 처리할 수 없다.
df_filled = imputer.fit_transform(train[int_col])

df_filled[10:20,:]

array([[11.    ,  1.    ,  3.    ,  4.    ,  1.    ,  1.    , 16.7   ],
       [12.    ,  1.    ,  1.    , 58.    ,  0.    ,  0.    , 26.55  ],
       [13.    ,  0.    ,  3.    , 20.    ,  0.    ,  0.    ,  8.05  ],
       [14.    ,  0.    ,  3.    , 39.    ,  1.    ,  5.    , 31.275 ],
       [15.    ,  0.    ,  3.    , 14.    ,  0.    ,  0.    ,  7.8542],
       [16.    ,  1.    ,  2.    , 55.    ,  0.    ,  0.    , 16.    ],
       [17.    ,  0.    ,  3.    ,  2.    ,  4.    ,  1.    , 29.125 ],
       [18.    ,  1.    ,  2.    , 29.8   ,  0.    ,  0.    , 13.    ],
       [19.    ,  0.    ,  3.    , 31.    ,  1.    ,  0.    , 18.    ],
       [20.    ,  1.    ,  3.    , 27.6   ,  0.    ,  0.    ,  7.225 ]])

비교해보면, 승객 18,20번이 na 였는데
18 : NaN -> 29.8
20 : NaN -> 27.6 으로 바뀐것을 알 수 있다.

2. fancyimpute KNNImputer

Medium Post Preprocessing: Encode and KNN Impute All Categorical Features Fast

# !python -m pip install fancyimpute

from fancyimpute import KNN

Using TensorFlow backend.

################`small talk abount OrdinalEncoder()` Start################

parameter 를 받는 형식과 return 값이 다른다.

1) OrdinalEncoder :(n_samples, n_features) 2D를 받고, 2D categories_ 를 사용한다. 또한, 이는 다른 인코더(OHE)와 같다고 할 수 있다.
2) LabelEncoder: (n_samples,) 1D를 받고 classes_ param을 사용한다. LabelEncoder 는 loop형으로 변환를 해서, 시간이 오래 걸린다.
OrdinalEncoder를 순서가 있는 특성에 적용하는 클래스라고 혼동하지 마세요. OrdinalEncoder는 순서가 없는 범주형 특성을 정수로 변환하는 클래스입니다.

enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 10]]
enc.fit(X)

OrdinalEncoder()

enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 3, 10], dtype=object)]

enc.transform([['Female', 1],['Female', 3],['Female', 10],['Male', 10],['Male', 3],['Male', 1]])

array([[0., 0.],
       [0., 1.],
       [0., 2.],
       [1., 2.],
       [1., 1.],
       [1., 0.]])

위 내용을 보면, Male,Feamale , 1,3,10 을 연계해서 Encoding 시킨것을 볼수 있다. 더욱이, 1,3,10 을 각각의 범주형으로 바꾸어 버린점을 흥미롭다.
위 그림처럼, 성별구분(2개) * (숫자 3개 1,3,10) 해서, 총 6개의 조합을 인코딩변환할 수 있다.

################ small talk abount OrdinalEncoder() End################

from sklearn.preprocessing import OrdinalEncoder 
pd.options.display.max_columns = None

## 예제에서 잼있게도 titanic data 를 사용하지만, kaggle 과 구성이 달라서, 일단 따라간다.
impute_data = sns.load_dataset('titanic')
impute_data.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

pd.concat([impute_data.isnull().sum(),impute_data.dtypes],axis=1,names=[['null','dtype']])

	0	1
survived	0	int64
pclass	0	int64
sex	0	object
age	177	float64
sibsp	0	int64
parch	0	int64
fare	0	float64
embarked	2	object
class	0	category
who	0	object
adult_male	0	bool
deck	688	category
embark_town	2	object
alive	0	object
alone	0	bool

pandas type 에서, category 로 되어 있는 것은 사실상 큰 의미가 없다. 따라서, 이를 제거해주고 목적에 충실하기로 한다.

impute_data['deck1'] = impute_data['deck'].astype(object,axit=0)
impute_data['class1'] = impute_data['class'].astype(object,axit=0)
impute_data = impute_data.drop(columns=['deck','class'],axis=1)

print(impute_data.shape)
impute_data.isnull().sum()

(891, 15)
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
who              0
adult_male       0
embark_town      2
alive            0
alone            0
deck1          688
class1           0
dtype: int64

embarked,embark_town,deck1 이 카테고리컬 변수이면서, null 이 존재하는걸 알 수 있다. 추후 아래과정을 통해서, 이것이 어떻게 변경되는지 알아보자

impute_data[impute_data['embarked'].isnull()]

	survived	pclass	sex	age	sibsp	parch	fare	embarked	who	adult_male	embark_town	alive	alone	deck1	class1
61	1	1	female	38.0	0	0	80.0	NaN	woman	False	NaN	yes	True	B	First
829	1	1	female	62.0	0	0	80.0	NaN	woman	False	NaN	yes	True	B	First

impute_data[impute_data['embark_town'].isnull()]

	survived	pclass	sex	age	sibsp	parch	fare	embarked	who	adult_male	embark_town	alive	alone	deck1	class1
61	1	1	female	38.0	0	0	80.0	NaN	woman	False	NaN	yes	True	B	First
829	1	1	female	62.0	0	0	80.0	NaN	woman	False	NaN	yes	True	B	First

impute_data[['embarked','embark_town','deck1']].iloc[60:65]

	embarked	embark_town	deck1
60	C	Cherbourg	NaN
61	NaN	NaN	B
62	S	Southampton	C
63	S	Southampton	NaN
64	C	Cherbourg	NaN

#instantiate both packages to use
encoder = OrdinalEncoder()
imputer = KNN()
# create a list of categorical columns to iterate over
cat_cols = ['embarked','class1','deck1','who','embark_town','sex','adult_male','alive','alone']

def encode(data):
    '''function to encode non-null data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_ordinal = encoder.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    data.loc[data.notnull()] = np.squeeze(impute_ordinal)
#     data.loc[data.notnull(),:] = np.squeeze(impute_ordinal)
    return data

#create a for loop to iterate through each column in the data
for columns in cat_cols:
    encode(impute_data[columns])

#create a for loop to iterate through each column in the data
for columns in cat_cols:
    encode(impute_data[columns])

상기식을 하나씩 분석해보면,

encoder = OrdinalEncoder()
imputer = KNN()
cat_cols = ['embarked','class1','deck1','who','embark_town','sex','adult_male','alive','alone']

embarked 를 예를 들어서, 살펴보면, S,C,Q 로 이루어져있다.

impute_data.embarked.value_counts()

S    644
C    168
Q     77
Name: embarked, dtype: int64

## na의 항목을 제거하고, 
nonulls = np.array(impute_data.embarked.dropna())
print(len(nonulls),nonulls.shape,impute_data.shape)
## Ordinal Encoder 적용을 위해, 2차원으로 만든다.
impute_reshape = nonulls.reshape(-1,1)
print(impute_reshape.shape,'\n',impute_reshape[0:5])

889 (889,) (891, 15)
(889, 1) 
 [['S']
 ['C']
 ['S']
 ['S']
 ['S']]

## null 을 제거한 이후, 2차원으로 만들고, 이를 숫자형으로 바꿔주면 (이때, 숫자값으로 들어가게 된다.)
impute_ordinal = encoder.fit_transform(impute_reshape)
print(impute_ordinal.shape,'\n',impute_ordinal[0:5])

(889, 1) 
 [[2.]
 [0.]
 [2.]
 [2.]
 [2.]]

# #Assign back encoded values to non-null values 숫자형으로 바꾼값을 원래값 대신 넣어준다.
impute_data.loc[impute_data.embarked.notnull(),:].loc[:,'embarked'] = np.squeeze(impute_ordinal) ## squeeze 는 dim 을 축소하는 함수

impute_data.iloc[60:65] ## embarked값이 숫자형으로 바뀐것을 확인할 수 있다.

	survived	pclass	sex	age	sibsp	parch	fare	embarked	who	adult_male	embark_town	alive	alone	deck1	class1
60	0	3	1.0	22.0	0	0	7.2292	0	1.0	1.0	0	0.0	1.0	NaN	2.0
61	1	1	0.0	38.0	0	0	80.0000	NaN	2.0	0.0	NaN	1.0	1.0	1	0.0
62	0	1	1.0	45.0	1	0	83.4750	2	1.0	1.0	2	0.0	0.0	2	0.0
63	0	3	1.0	4.0	3	2	27.9000	2	0.0	0.0	2	0.0	0.0	NaN	2.0
64	0	1	1.0	NaN	0	0	27.7208	0	1.0	1.0	0	0.0	1.0	NaN	0.0

상기 과정은 결국, 각 카테고리컬 컬럼들을 null 을 제외한 이후, 인코딩하여 바꾸어주는 1번째 과정이다. for columns in cat_cols: encode(impute_data[columns])

## 여기서, 실제로 null 인 값들을 KNN 방식으로 impute 한다.
# impute data and convert 
encode_data = pd.DataFrame(np.round(imputer.fit_transform(impute_data)),columns = impute_data.columns)

Imputing row 1/891 with 1 missing, elapsed time: 0.128
Imputing row 101/891 with 1 missing, elapsed time: 0.130
Imputing row 201/891 with 1 missing, elapsed time: 0.131
Imputing row 301/891 with 2 missing, elapsed time: 0.133
Imputing row 401/891 with 1 missing, elapsed time: 0.135
Imputing row 501/891 with 1 missing, elapsed time: 0.136
Imputing row 601/891 with 1 missing, elapsed time: 0.138
Imputing row 701/891 with 0 missing, elapsed time: 0.140
Imputing row 801/891 with 1 missing, elapsed time: 0.141

encode_data.iloc[60:65]

	survived	pclass	sex	age	sibsp	parch	fare	embarked	who	adult_male	embark_town	alive	alone	deck1	class1
60	0.0	3.0	1.0	22.0	0.0	0.0	7.0	0.0	1.0	1.0	0.0	0.0	1.0	5.0	2.0
61	1.0	1.0	0.0	38.0	0.0	0.0	80.0	0.0	2.0	0.0	0.0	1.0	1.0	1.0	0.0
62	0.0	1.0	1.0	45.0	1.0	0.0	83.0	2.0	1.0	1.0	2.0	0.0	0.0	2.0	0.0
63	0.0	3.0	1.0	4.0	3.0	2.0	28.0	2.0	0.0	0.0	2.0	0.0	0.0	4.0	2.0
64	0.0	1.0	1.0	40.0	0.0	0.0	28.0	0.0	1.0	1.0	0.0	0.0	1.0	2.0	0.0

과정을 요약해보면,

null 인 얘들을 제외하고, ordinal_encoder (이때, 숫자형으로 변하면서, 카테코리컬 컬럼이, order 순서를 가지게 되는것은 피할 수 없음)
이후, 이를 KNN Imputer로 null 처리한다.
카테고리컬 컬럼이, order 파워를 가지게 되는것은 딱히 좋아보이지 않는다.

Twitter Facebook LinkedIn

취미로 먹고 살고싶은 IT 개발자

Handing Missing Value usin KNN

Reference

1. Scikit-Learn KNNImputer

2. fancyimpute KNNImputer

################`small talk abount OrdinalEncoder()` Start################

################ small talk abount OrdinalEncoder() End################

Comments

You May Also Enjoy

System Sample

Movie reaction Sentiment Analysis using CNN (Naver Movie)

Sentiment Analysis using Korean

Word Embedding using konlpy 02

취미로 먹고 살고싶은 IT 개발자

Reference

1. Scikit-Learn KNNImputer

2. fancyimpute KNNImputer

################small talk abount OrdinalEncoder() Start################

################ small talk abount OrdinalEncoder() End################

Comments

You May Also Enjoy

System Sample

Movie reaction Sentiment Analysis using CNN (Naver Movie)

Sentiment Analysis using Korean

Word Embedding using konlpy 02

################`small talk abount OrdinalEncoder()` Start################