Handling_missingval_using_knn_02

13 minute read


title: “Handing Missing Categorical Value using Hamming distance” excerpt: “using sckit-learn or handmade Function Test”

categories:

  • Useful_Fuction tags:
  • KNN
  • OrdinalEncoder
  • Categorical Missing Value
    last_modified_at: 2020-05-23T08:06:00-05:00 —

지난 포스팅은 scikit-learn 0.23.1 에서 출시한, KNNimputer 를 사용했다는 점에서 흥미롭긴 하지만, 원래 목적이었던, categorical 변수에 대해서는 약간의 아쉬움이 남았다.
이번에는 categorical 변수에 대해서, distance 개념으로, 즉, 다른 방법으로 시도했던 자료를 찾아서 포스팅한다.

Reference

Missing Value 의 3가지 type

1. MCAR (missing completely at random): 특정 변수의 결측치가 완전히 무작위적으로 발생할 경우

MCAR when the probability of missing data on a variable X is unrelated to other measured variables and to the values of X itself.
좀 더 정확히는, 변수 x의 결측이 발생하는 확률이 x의 값 자체나 다른 변수들과 관련이 없을 때

2.MAR (missing at random): 특정 변수의 결측의 여부가 자료 내의 다른 변수와 관련이 있는 경우

예를들어 학업성취 점수의 결측 여부가 소득수준과 관련이 있을 때 (즉, 소득수준이 낮은 아이들이 학업성취점수에 응답하지 않음) 예를들어, 남성들이 우울증에 대한 설문에 답변을 하지 않는 경우가 많은데, 이런 결측값들은 실제 우울증 정도와는 무관하다.

3.MNAR (missing not at random, non ignorable): 결측여부가 해당변수의 값에 의해 결정

예를들어 학업성취가 낮은 아이들이 학업성취에 응답하지 않음 예를들어 우울증정도에 따라서, 진짜로 설문조사에 응하지 않을 경우.

대부분의 분석들, 결측자료를 처리하는 방법들(SEM에서 FIML, EM, multiple imputation)은 MCAR, MAR을 가정하고 있음

Distance of Categorical Variable

숫자형으로 labeling 하지 않는다면, 카테고리컬 변수에 distance 개념을 적용하는 방법은 빈도 및 유사성과 관련이 있다.
여기서는 Hamming distanceWeighted Hamming distance 를 다룬다.

Hamming distance

  • 모든 범주형 속성을 취하며, 두 점(로우) 사이의 값이 같지 않을 경우 각 범주형 속성을 1로 계산한다. 해밍 거리는 그 값이 다른 속성의 수입니다.

Weighted Hamming distance

  • 값이 다르면 1을 반환하지만, 일치하면 속성의 값의 빈도를 반환하여 해당 범주값이 더 빈번할 때(자주 나오는 범주값일 경우)거리를 증가시킨다.
  • 둘 이상의 속성이 범주형일 경우 조화평균(harmonic)이 적용된다.
  • 결과는 0과 1 사이에 유지되지만 평균 값은 산술 평균에 비해 낮은 값으로 이동한다.
import numpy as np
import pandas as pd
from collections import defaultdict
from scipy.stats import hmean
from scipy.spatial.distance import cdist
from scipy import stats
import numbers
import warnings
warnings.filterwarnings("ignore")
train = pd.read_csv('../dataset/titanic_train.csv')
test = pd.read_csv('../dataset/titanic_test.csv')
# train = pd.read_csv('../ML_Area/data_source/titanic_train.csv')
# test = pd.read_csv('../ML_Area/data_source/titanic_test.csv')
# ML_Area\data_source
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set() # setting seaborn default for plots
pd.concat([train.isnull().sum(),train.dtypes],axis=1,names=[['null','dtype']])
0 1
PassengerId 0 int64
Survived 0 int64
Pclass 0 int64
Name 0 object
Sex 0 object
Age 177 float64
SibSp 0 int64
Parch 0 int64
Ticket 0 object
Fare 0 float64
Cabin 687 object
Embarked 2 object

범주형 변수값들의 거리를 구해보자

not_cat_df = train[[col for col in train.columns if col not in ['Cabin','Embarked']]].copy() ## Cabin 제외한 df
int_col = train.columns[train.dtypes != 'object'].to_list()
cat_col = train.columns[train.dtypes == 'object'].to_list()
train.loc[train["Embarked"].isnull(),:]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN
## 나중에 확인할 데이터 영역
train.iloc[60:65]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
60 61 0 3 Sirayanian, Mr. Orsen male 22.0 0 0 2669 7.2292 NaN C
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN
62 63 0 1 Harris, Mr. Henry Birkhardt male 45.0 1 0 36973 83.4750 C83 S
63 64 0 3 Skoog, Master. Harald male 4.0 3 2 347088 27.9000 NaN S
64 65 0 1 Stewart, Mr. Albert A male NaN 0 0 PC 17605 27.7208 NaN C
def weighted_hamming(data):
    """ Compute weighted hamming distance on categorical variables. For one variable, it is equal to 1 if
        the values between point A and point B are different, else it is equal the relative frequency of the
        distribution of the value across the variable. For multiple variables, the harmonic mean is computed
        up to a constant factor.
        @params:
            - data = a pandas data frame of categorical variables
        @returns:
            - distance_matrix = a distance matrix with pairwise distance for all attributes
    """
    categories_dist = []
    
    for category in data:
        X = pd.get_dummies(data[category])
        X_mean = X * X.mean()
        X_dot = X_mean.dot(X.transpose())
        X_np = np.asarray(X_dot.replace(0,1,inplace=False))
        categories_dist.append(X_np)
    categories_dist = np.array(categories_dist)
    distances = hmean(categories_dist, axis=0)
    return distances

Weighted Hammintun 거리

data_categorical = ['Embarked']
train[data_categorical].shape
(891, 1)
X = pd.get_dummies(train['Embarked'])
X[60:65] ## 61 은 그냥, NAN 이란 그룹으로 남았다. 현재 따로 함수로 사용하기 때문에 이렇게 된 것이지. 실제로는 다른 값으로 missing value가 채워진다.
C Q S
60 1 0 0
61 0 0 0
62 0 0 1
63 0 0 1
64 1 0 0
X = pd.get_dummies(train['Embarked'])
print(X.shape,'\n',X.iloc[60:65]) ## 카테고리밸류값이 3개이니, 3개의 컬럼을 가지게 된다.
X_mean = X * X.mean() ## dataframe*series 로, series의 인덱스, C,Q,R 의 평균값을 각각의 열별에 맞추어서 구하게 된다. 
X_dot = X_mean.dot(X.transpose()) ## 자기의 Transformation 을 dot 하게 되면, n by n 정방행렬이 된다.
print(X_dot.iloc[60:65,0:10])
(891, 3) 
     C  Q  S
60  1  0  0
61  0  0  0
62  0  0  1
63  0  0  1
64  1  0  0
           0         1         2         3         4    5         6         7  \
60  0.000000  0.188552  0.000000  0.000000  0.000000  0.0  0.000000  0.000000   
61  0.000000  0.000000  0.000000  0.000000  0.000000  0.0  0.000000  0.000000   
62  0.722783  0.000000  0.722783  0.722783  0.722783  0.0  0.722783  0.722783   
63  0.722783  0.000000  0.722783  0.722783  0.722783  0.0  0.722783  0.722783   
64  0.000000  0.188552  0.000000  0.000000  0.000000  0.0  0.000000  0.000000   

           8         9  
60  0.000000  0.188552  
61  0.000000  0.000000  
62  0.722783  0.000000  
63  0.722783  0.000000  
64  0.000000  0.188552  
X_np = np.asarray(X_dot.replace(0,1,inplace=False)) ## 0 인 값들을 1로 변환해준다.
print(X_np[60:65,0:10])
[[1.         0.18855219 1.         1.         1.         1.
  1.         1.         1.         0.18855219]
 [1.         1.         1.         1.         1.         1.
  1.         1.         1.         1.        ]
 [0.72278339 1.         0.72278339 0.72278339 0.72278339 1.
  0.72278339 0.72278339 0.72278339 1.        ]
 [0.72278339 1.         0.72278339 0.72278339 0.72278339 1.
  0.72278339 0.72278339 0.72278339 1.        ]
 [1.         0.18855219 1.         1.         1.         1.
  1.         1.         1.         0.18855219]]
categories_dist = []
categories_dist.append(X_np)
categories_dist = np.array(categories_dist)
distances = hmean(categories_dist, axis=0) ## Calculates the harmonic mean along the specified axis.

해석하면, 60번째 행과 거리가 가장 가까운 곳은 0.18855~ 값을 가지는 1번째,9번재 값을 보자.
값을 보게 되면, 아래와 같이 Embarked 값이 C 로 같은 것을 알 수 있다. 당연하게도, Embarked 값이 같기 때문이며, 0.18855~ 의 distance 값을 가지는 것은 모두 C 임을 알수있다.

train.iloc[[1,9,60],:]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
60 61 0 3 Sirayanian, Mr. Orsen male 22.0 0 0 2669 7.2292 NaN C

value_counts() 결과를 보면, S의 비율이 가장높고, 그 다음이 C , Q 임을 알 수 있다. 이런 빈도수 역시, 영향을 미치게 되는데, 상대적으로 많은 수를 가지는

X.mean()
C    0.188552
Q    0.086420
S    0.722783
dtype: float64
train.Embarked.value_counts() ## 
S    644
C    168
Q     77
Name: Embarked, dtype: int64

하기 distance_matrix 함수가 사실 distance를 구하는 메인함수인데, weighted_hamming 용으로 상기 함수를 따로 빼둔것이다.
distance_matrix함수 역시, distance matrix 를 return한다.

def distance_matrix(data, numeric_distance = "euclidean", categorical_distance = "jaccard"): ## jaccard 개념은 따로 밑에서 부연한다.
    """ Compute the pairwise distance attribute by attribute in order to account for different variables type:
        - Continuous
        - Categorical
        For ordinal values, provide a numerical representation taking the order into account.
        Categorical variables are transformed into a set of binary ones.
        If both continuous and categorical distance are provided, a Gower-like distance is computed and the numeric
        variables are all normalized in the process.
        If there are missing values, the mean is computed for numerical attributes and the mode for categorical ones.
        
        Note: If weighted-hamming distance is chosen, the computation time increases a lot since it is not coded in C 
        like other distance metrics provided by scipy.
        @params:
            - data                  = pandas dataframe to compute distances on.
            - numeric_distances     = the metric to apply to continuous attributes.
                                      "euclidean" and "cityblock" available.
                                      Default = "euclidean"
            - categorical_distances = the metric to apply to binary attributes.
                                      "jaccard", "hamming", "weighted-hamming" and "euclidean"
                                      available. Default = "jaccard"
        @returns:
            - the distance matrix
    """
    possible_continuous_distances = ["euclidean", "cityblock"]
    possible_binary_distances = ["euclidean", "jaccard", "hamming", "weighted-hamming"]
    number_of_variables = data.shape[1]
    number_of_observations = data.shape[0]

    # Get the type of each attribute (Numeric or categorical)
    is_numeric = [all(isinstance(n, numbers.Number) for n in data.iloc[:, i]) for i, x in enumerate(data)]
    is_all_numeric = sum(is_numeric) == len(is_numeric)
    is_all_categorical = sum(is_numeric) == 0
    is_mixed_type = not is_all_categorical and not is_all_numeric

    # Check the content of the distances parameter
    if numeric_distance not in possible_continuous_distances:
        print("The continuous distance " + numeric_distance + " is not supported.")
        return None
    elif categorical_distance not in possible_binary_distances:
        print("The binary distance " + categorical_distance + " is not supported.")
        return None

    # Separate the data frame into categorical and numeric attributes and normalize numeric data
    if is_mixed_type:
        number_of_numeric_var = sum(is_numeric)
        number_of_categorical_var = number_of_variables - number_of_numeric_var
        data_numeric = data.iloc[:, is_numeric]
        data_numeric = (data_numeric - data_numeric.mean()) / (data_numeric.max() - data_numeric.min())
        data_categorical = data.iloc[:, [not x for x in is_numeric]]

    # Replace missing values with column mean for numeric values and mode for categorical ones. With the mode, it
    # triggers a warning: "SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame"
    # but the value are properly replaced
    ## 카테고릴 데이터의 missing value 는 일단, 맨 첫번째 mode()값으로 대채한다.
    if is_mixed_type:
        data_numeric.fillna(data_numeric.mean(), inplace=True) ## numeric 변수는 일단 mean() missing value를 대체한다.
        for x in data_categorical:
            data_categorical[x].fillna(data_categorical[x].mode()[0], inplace=True)
    elif is_all_numeric:
        data.fillna(data.mean(), inplace=True)
    else:
        for x in data:
            data[x].fillna(data[x].mode()[0], inplace=True)

    # "Dummifies" categorical variables in place
    if not is_all_numeric and not (categorical_distance == 'hamming' or categorical_distance == 'weighted-hamming'):
        if is_mixed_type:
            data_categorical = pd.get_dummies(data_categorical)
        else:
            data = pd.get_dummies(data)
    elif not is_all_numeric and categorical_distance == 'hamming':
        if is_mixed_type:
            data_categorical = pd.DataFrame([pd.factorize(data_categorical[x])[0] for x in data_categorical]).transpose()
        else:
            data = pd.DataFrame([pd.factorize(data[x])[0] for x in data]).transpose()

    if is_all_numeric:
        result_matrix = cdist(data, data, metric=numeric_distance)
    elif is_all_categorical:
        if categorical_distance == "weighted-hamming":
            result_matrix = weighted_hamming(data)
        else:
            result_matrix = cdist(data, data, metric=categorical_distance)
    else:
        result_numeric = cdist(data_numeric, data_numeric, metric=numeric_distance)
        if categorical_distance == "weighted-hamming":
            result_categorical = weighted_hamming(data_categorical)
        else:
            result_categorical = cdist(data_categorical, data_categorical, metric=categorical_distance)
        result_matrix = np.array([[1.0*(result_numeric[i, j] * number_of_numeric_var + result_categorical[i, j] *
                               number_of_categorical_var) / number_of_variables for j in range(number_of_observations)] for i in range(number_of_observations)])

    # Fill the diagonal with NaN values
    np.fill_diagonal(result_matrix, np.nan)

    return pd.DataFrame(result_matrix)

image.png

image.png

print(train.shape)
train.iloc[60:65]
(891, 12)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
60 61 0 3 Sirayanian, Mr. Orsen male 22.0 0 0 2669 7.2292 NaN C
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0000 B28 NaN
62 63 0 1 Harris, Mr. Henry Birkhardt male 45.0 1 0 36973 83.4750 C83 S
63 64 0 3 Skoog, Master. Harald male 4.0 3 2 347088 27.9000 NaN S
64 65 0 1 Stewart, Mr. Albert A male NaN 0 0 PC 17605 27.7208 NaN C

[‘Age’,’Cabin’,’Embarked’] 이 train 셋에서, null 값이 있다.

################sFunction distance_matrix 함수 탐구하기 Start################

tmp01 = [(i,x) for i, x in enumerate(train)] ## df 를 enumerate하면 컬럼을 순서데로 긁어온다.
print(tmp01)
[(0, 'PassengerId'), (1, 'Survived'), (2, 'Pclass'), (3, 'Name'), (4, 'Sex'), (5, 'Age'), (6, 'SibSp'), (7, 'Parch'), (8, 'Ticket'), (9, 'Fare'), (10, 'Cabin'), (11, 'Embarked')]
[all(isinstance(n, numbers.Number) for n in train.iloc[:, i]) for i, x in enumerate(train)]
[True, True, True, False, False, True, True, True, False, True, False, False]
train['Embarked'].mode()[0] ## model : 항상 Series를 return 하는 것. series에 [0]을 했으니, 첫번째 값이 나온다.
'S'
print(train['Embarked'].value_counts())
# print(train['Cabin'].value_counts())
print(len(pd.factorize(train['Embarked'])[0]))
pd.factorize(train['Embarked'])[0][0:10] ## R에서 factor 변수화 하는 것이랑 같다. label링으로 숫자형으로 바뀐다.
S    644
C    168
Q     77
Name: Embarked, dtype: int64
891
array([0, 1, 0, 0, 0, 2, 0, 0, 0, 1], dtype=int64)
data_categorical = ['Cabin','Embarked']
pd.DataFrame([pd.factorize(train[x])[0] for x in data_categorical])
0 1 2 3 4 5 6 7 8 9 ... 881 882 883 884 885 886 887 888 889 890
0 -1 0 -1 1 -1 -1 2 -1 -1 -1 ... -1 -1 -1 -1 -1 -1 145 -1 146 -1
1 0 1 0 0 0 2 0 0 0 1 ... 0 0 0 0 2 0 0 0 1 2

2 rows × 891 columns

a
C Q S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
a * a.mean()
C Q S
0 0.0 0.0 0.8
1 0.2 0.0 0.0
2 0.0 0.0 0.8
3 0.0 0.0 0.8
4 0.0 0.0 0.8
categories_dist.shape
(1, 891, 891)

################sFunction distance_matrix 함수 탐구하기 End################

def knn_impute(target, attributes, k_neighbors, aggregation_method="mean", numeric_distance="euclidean",
               categorical_distance="jaccard", missing_neighbors_threshold = 0.5):
    """ Replace the missing values within the target variable based on its k nearest neighbors identified with the
        attributes variables. If more than 50% of its neighbors are also missing values, the value is not modified and
        remains missing. If there is a problem in the parameters provided, returns None.
        If to many neighbors also have missing values, leave the missing value of interest unchanged.
        @params:
            - target                        = a vector of n values with missing values that you want to impute. The length has
                                              to be at least n = 3.
            - attributes                    = a data frame of attributes with n rows to match the target variable
            - k_neighbors                   = the number of neighbors to look at to impute the missing values. It has to be a
                                              value between 1 and n.
            - aggregation_method            = how to aggregate the values from the nearest neighbors (mean, median, mode)
                                              Default = "mean"
            - numeric_distances             = the metric to apply to continuous attributes.
                                              "euclidean" and "cityblock" available.
                                              Default = "euclidean"
            - categorical_distances         = the metric to apply to binary attributes.
                                              "jaccard", "hamming", "weighted-hamming" and "euclidean"
                                              available. Default = "jaccard"
            - missing_neighbors_threshold   = minimum of neighbors among the k ones that are not also missing to infer
                                              the correct value. Default = 0.5
        @returns:
            target_completed        = the vector of target values with missing value replaced. If there is a problem
                                      in the parameters, return None
    """

    # Get useful variables
    possible_aggregation_method = ["mean", "median", "mode"]
    number_observations = len(target)
    is_target_numeric = all(isinstance(n, numbers.Number) for n in target)

    # Check for possible errors
    if number_observations < 3:
        print( "Not enough observations.")
        return None
    if attributes.shape[0] != number_observations:
        print("The number of observations in the attributes variable is not matching the target variable length.")
        return None
    if k_neighbors > number_observations or k_neighbors < 1:
        print("The range of the number of neighbors is incorrect.")
        return None
    if aggregation_method not in possible_aggregation_method:
        print("The aggregation method is incorrect.")
        return None
    if not is_target_numeric and aggregation_method != "mode":
        print("The only method allowed for categorical target variable is the mode.")
        return None

    # Make sure the data are in the right format
    target = pd.DataFrame(target)
    attributes = pd.DataFrame(attributes)

    # Get the distance matrix and check whether no error was triggered when computing it
    distances = distance_matrix(attributes, numeric_distance, categorical_distance) ## target 컬럼을 제외하고 distance를 구한다.
    if distances is None:
        return None

    # Get the closest points and compute the correct aggregation method
    for i, value in enumerate(target.iloc[:, 0]):
        if pd.isnull(value):
            order = distances.iloc[i,:].values.argsort()[:k_neighbors] ## argsort():오름차순으로의 원소들들의 index 값을 반환한다. 즉 거리가 가까운 index 들을 부른다.
            closest_to_target = target.iloc[order, :]
#             print("closest_to_target \n",closest_to_target)
#             print(type(closest_to_target),closest_to_target.shape)
            missing_neighbors = [x for x  in closest_to_target.isnull().iloc[:, 0]]
            # Compute the right aggregation method if at least more than 50% of the closest neighbors are not missing
            if sum(missing_neighbors) >= missing_neighbors_threshold * k_neighbors:
                continue
            elif aggregation_method == "mean":
                target.iloc[i] = np.ma.mean(np.ma.masked_array(closest_to_target,np.isnan(closest_to_target)))
            elif aggregation_method == "median":
                target.iloc[i] = np.ma.median(np.ma.masked_array(closest_to_target,np.isnan(closest_to_target)))
            else: ## aggregation_method == "mode" 일때를 의미한다.
#                 target.iloc[i] = stats.mode(closest_to_target, nan_policy='omit')[0][0] ## stat 에 dataframe param을 받을수는 있으나, NaN 인식이 불량하다. 
                ## closest_to_target 는 어짜피 dataFrame 이고, 이 구절은 na 을 없애는게 목적이니, 하기와 같이 대체한다.
                closest_to_target.dropna(inplace=True)
                target.iloc[i] = closest_to_target.iloc[0][0]

    return target

knn_impute 실제 활용하기

Age null 값채우기 - Numeric 컬럼

print("before\n",train.loc[5:10,'Age'])
before
 5      NaN
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
Name: Age, dtype: float64
new_train = knn_impute(target=train['Age'], attributes=train.drop(['Age', 'PassengerId'], 1),\
                       aggregation_method="median", k_neighbors=10, numeric_distance='euclidean',\
                       categorical_distance='hamming', missing_neighbors_threshold=0.8)
print(new_train.shape)
(891, 1)
print("after\n",new_train.loc[5:10,'Age'])
before
 5     48.5
6     54.0
7      2.0
8     27.0
9     14.0
10     4.0
Name: Age, dtype: float64

Embarked null 값채우기 - Categorical 컬럼

print("before\n",train.loc[60:65,'Embarked'])
before
 60      C
61    NaN
62      S
63      S
64      C
65      C
Name: Embarked, dtype: object
## data_categorical = ['Cabin','Embarked']
## ["euclidean", "jaccard", "hamming", "weighted-hamming"]
new_train_Embarked = knn_impute(target=train['Embarked'], attributes=train.drop(['Embarked', 'PassengerId'], 1),\
                       aggregation_method="mode", k_neighbors=10, numeric_distance='euclidean',\
                       categorical_distance='jaccard', missing_neighbors_threshold=0.8)
print("after\n",new_train_Embarked.loc[60:65,'Embarked'])
after
 60    C
61    S
62    S
63    S
64    C
65    C
Name: Embarked, dtype: object

가장 가까운 이웃의 값으로 61번 행 값에, S가 들어왔음을 알 수 있다.

Updated:

Comments