PCA의 원리와 SVD원리는 뭐가 다른 걸까
Principal component analysis
reference
- PCA 개념관련_01 : https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/
- PCA 개념관련_02 : https://darkpgmr.tistory.com/104
- 코드관련 : StatQuest youtube - “PCA in Python” 을 참고하여 진행했다
- 포아송분포관련 blog : https://blog.naver.com/PostView.nhn?blogId=mykepzzang&logNo=220840724901
Loding Library
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt # NOTE: This was tested with matplotlib v. 2.1.0
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
The scikit-learn version is 0.22.1.
Data Generation Code
## load iris
from sklearn.datasets import load_iris
data = load_iris()
print(data.keys())
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
iris = data.data
y_target = data.target
y_target_name = data.target_names
print(data.filename)
C:\ProgramData\Anaconda3\envs\gpu_test\lib\site-packages\sklearn\datasets\data\iris.csv
# features = ['sepal length', 'sepal width', 'petal length', 'petal width']
from sklearn.preprocessing import StandardScaler
# Separating out the features
x = iris
# Separating out the target
y = y_target
# Standardizing the features
x = StandardScaler().fit_transform(x)
scikit-learn PCA 의 get_covariance()
pca01 = PCA(n_components=2,svd_solver='full')
decompose_x_01 = pca01.fit_transform(x)
print(pca01.components_,'\n',"pca01.components_ shape :{}".format(pca01.components_.shape))
print('\n',pca01.explained_variance_)
print('\n',pca01.singular_values_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654]
[ 0.37741762 0.92329566 0.02449161 0.06694199]]
pca01.components_ shape :(2, 4)
[2.93808505 0.9201649 ]
[20.92306556 11.7091661 ]
Compute data covariance with the generative model.
cov = components_.T * S^2 * components_ + sigma2 * eye(n_features)
where S^2 contains the explained variances, and sigma2 contains the noise variances.
여기서 헷갈리는 설명은 S 가 explained variances 가 아니라, S^2 = explained variances 란 점이다.
eye : 이거는 numpy 에서, 1을 원소로 하는 대각행렬을 말한다.
pca01.noise_variance_
: Equal to the average of (min(n_features, n_samples) - n_components) smallest eigenvalues of the covariance matrix of X.
pca01.noise_variance_
0.08429784161070522
pca01.components_.T.shape
(4, 2)
mat_power_ev.shape
(2, 2)
pca01.components_.shape
(2, 4)
k = np.matmul(pca01.components_.T,np.diag(pca01.explained_variance_))
k2 = np.matmul(k,pca01.components_)
k3 = pca01.noise_variance_ * np.eye(x.shape[1])
k2
array([[ 0.92879058, -0.09170562, 0.89708089, 0.88800724],
[-0.09170562, 0.99756979, -0.43851133, -0.39013523],
[ 0.89708089, -0.43851133, 0.99033217, 0.96476019],
[ 0.88800724, -0.39013523, 0.96476019, 0.94155742]])
k3
array([[0.08429784, 0. , 0. , 0. ],
[0. , 0.08429784, 0. , 0. ],
[0. , 0. , 0.08429784, 0. ],
[0. , 0. , 0. , 0.08429784]])
np.round(k2+k3,6)
array([[ 1.013088, -0.091706, 0.897081, 0.888007],
[-0.091706, 1.081868, -0.438511, -0.390135],
[ 0.897081, -0.438511, 1.07463 , 0.96476 ],
[ 0.888007, -0.390135, 0.96476 , 1.025855]])
np.round(pca01.get_covariance(),6)
array([[ 0.978193, -0.10925 , 0.870807, 0.861066],
[-0.10925 , 1.00389 , -0.427239, -0.38252 ],
[ 0.870807, -0.427239, 1.046181, 0.936985],
[ 0.861066, -0.38252 , 0.936985, 0.998581]])
위의 과정으로 Full componet 값으로 분해하면, 동일한 값이 나왔음을 알 수 있다.
feature 를 줄이면, noise 값에 의해, 완벽히 똑같지는 않다. (4일때는 완벽히 똑같다.)
PCA doc에는 SVD가 언급되는 이유와 실제 비교
Scikit-Learn에서는 PCA를 계산할 때, 데이터셋에 대한 공분산의 고유값 분해(eigenvalue-decomposition)이 아닌 특이값 분해(SVD, Singular Value Decomposition)를 이용해 계산한다.
출처: https://excelsior-cjh.tistory.com/167 [EXCELSIOR]
이유는 stackoverflow에서 확인할 수 있듯이, eigenvalue-decomposition에서는 공분산 행렬을 메모리상에 가지고 있어야하는 반면 SVD는 공분산 행렬을 따로 메모리에 저장할 필요가 없으므로 효율적이기 때문이다.
print(x.shape)
(150, 4)
PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
위 값은 default 값이다. 따라서, 아래의 svd_solver 는 auto로 적용되었다.
헌데, svd_solver로 이름 붙인 이유가 뭘까?
svd_solver : {‘auto’, ‘full’, ‘arpack’, ‘randomized’}
auto==default
- The solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
full
- run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
arpack
- run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
randomized
- run randomized SVD by the method of Halko et al.
from sklearn.decomposition import PCA
pca00 = PCA(n_components=2,svd_solver='auto') ## 상기 조건1:(data 500 * 500) & 조건2:(data 80% 이하로 축소하는 경우) 를 만족시키지 못했기 대문에 'full SVD' 가 계산된다.
principalComponents = pca00.fit_transform(x)
print(pca00.components_,'\n',"pca00.components_ shape :{}".format(pca00.components_.shape))
print('\n',pca00.explained_variance_)
print('\n',pca00.singular_values_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654]
[ 0.37741762 0.92329566 0.02449161 0.06694199]]
pca00.components_ shape :(2, 4)
[2.93808505 0.9201649 ]
[20.92306556 11.7091661 ]
pca01 = PCA(n_components=2,svd_solver='full')
decompose_x_01 = pca01.fit_transform(x)
print(pca01.components_,'\n',"pca01.components_ shape :{}".format(pca01.components_.shape))
print('\n',pca01.explained_variance_)
print('\n',pca01.singular_values_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654]
[ 0.37741762 0.92329566 0.02449161 0.06694199]]
pca01.components_ shape :(2, 4)
[2.93808505 0.9201649 ]
[20.92306556 11.7091661 ]
pca02 = PCA(n_components=2,svd_solver='arpack')
decompose_x_02 = pca02.fit_transform(x)
print(pca02.components_,'\n',"pca02.components_ shape :{}".format(pca02.components_.shape))
print('\n',pca02.explained_variance_)
print('\n',pca02.singular_values_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654]
[ 0.37741762 0.92329566 0.02449161 0.06694199]]
pca02.components_ shape :(2, 4)
[2.93808505 0.9201649 ]
[20.92306556 11.7091661 ]
pca03 = PCA(n_components=2,svd_solver='randomized')
decompose_x_03 = pca03.fit_transform(x)
print(pca03.components_,'\n',"pca03.components_ shape :{}".format(pca03.components_.shape))
print('\n',pca03.explained_variance_)
print('\n',pca03.singular_values_)
[[ 0.52106591 -0.26934744 0.5804131 0.56485654]
[ 0.37741762 0.92329566 0.02449161 0.06694199]]
pca03.components_ shape :(2, 4)
[2.93808505 0.9201649 ]
[20.92306556 11.7091661 ]
svd_solver option 를 모두 했을때 결과는 일단 동일하다는 것을 알 수 있다.
numpy.linalg 라이브러리를 활용했을때, 동일한 결과를 얻는지 확인해보자!!
np.linalg.eig 활용하기
pca04 = PCA(n_components=4,svd_solver='full')
decompose_x_04 = pca04.fit_transform(x)
eigenvalue, eigenvector = np.linalg.eig(np.cov(x.transpose()))
explained_variance_ratio_sum_ = np.cumsum(pca04.explained_variance_ratio_)
index = eigenvalue.argsort()[::-1]
eigenvalue = eigenvalue[index]
eigenvector = eigenvector[:, index]
eigenvalue
array([2.93808505, 0.9201649 , 0.14774182, 0.02085386])
eigenvector
array([[ 0.52106591, -0.37741762, -0.71956635, 0.26128628],
[-0.26934744, -0.92329566, 0.24438178, -0.12350962],
[ 0.5804131 , -0.02449161, 0.14212637, -0.80144925],
[ 0.56485654, -0.06694199, 0.63427274, 0.52359713]])
pca04.explained_variance_
array([2.93808505, 0.9201649 , 0.14774182, 0.02085386])
pca04.components_
array([[ 0.52106591, -0.26934744, 0.5804131 , 0.56485654],
[ 0.37741762, 0.92329566, 0.02449161, 0.06694199],
[-0.71956635, 0.24438178, 0.14212637, 0.63427274],
[-0.26128628, 0.12350962, 0.80144925, -0.52359713]])
동일하다
np.linalg.svd 활용하기
## scipy.sparse.linalg 와 numpy 의 linalg 는 다르다. 구분해서 사용하자.
u, s, vh = np.linalg.svd(x,full_matrices=True)
## u : left singular vectors
## s : The singular values.
## vh : right singular vectors as rows
print(u.shape,s.shape,vt.shape)
(150, 150) (4,) (4, 4)
print('singular value :', s)
print('right singular vector :\n', vh.T)
singular value : [20.92306556 11.7091661 4.69185798 1.76273239]
right singular vector :
[[ 0.52106591 -0.37741762 0.71956635 0.26128628]
[-0.26934744 -0.92329566 -0.24438178 -0.12350962]
[ 0.5804131 -0.02449161 -0.14212637 -0.80144925]
[ 0.56485654 -0.06694199 -0.63427274 0.52359713]]
print(pca04.explained_variance_) ## eigen decomp 의 고유치
print(pca04.singular_values_) ## sigular value 는 자동으로 구해진다. 또한 동일함을 알 수 있다.
[2.93808505 0.9201649 0.14774182 0.02085386]
[20.92306556 11.7091661 4.69185798 1.76273239]
원본 x 를 재구성하기
## 행렬A 를 구하고, np.dot(A,A.T) 하면 x 가 나올것이다. 단, svd 분해에서 thin 으로 나오기 때문에, s 에 대해 약간의 변형이 필요하다.
## A 구하기
smat = np.zeros((150, 4), dtype=complex)
smat[:4, :4] = np.diag(s)
print(u.shape,s.shape,smat.shape,vt.shape)
(150, 150) (4,) (150, 4) (4, 4)
back_x = np.matmul(u,smat)
back_x = np.matmul(back_x,vt)
print(back_x.shape)
np.round(back_x[0:2],4)
(150, 4)
array([[-0.9007+0.j, 1.019 +0.j, -1.3402+0.j, -1.3154+0.j],
[-1.143 +0.j, -0.132 +0.j, -1.3402+0.j, -1.3154+0.j]])
x[0:2]
array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],
[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ]])
동일한 결과를 보여주는 것을 알 수 있다.
Comments