20 News Group Basic 02

11 minute read

Reference

  • 파이썬 머신러닝 완벽가이드 - 권철민
  • NCIA shkim.hi@gmail.com
  • 디리클레 분포

20 Newsgroup 토픽 모델링

  • 20개 중 8개의 주제 데이터 로드 및 Count기반 피처 벡터화
  • LDA는 Count기반 Vectorizer만 적용함

Topic Modeling

머신러닝 기반의 토픽 모델링은 숨겨진 주제를 효과적으로 표현할 수 있는 중심단어를 함축적으로 추출함

머신 러닝 기반의 토픽 모델링 알고리즘

LSA(Latent Semantic Analysis), pLSA
LDA(Latent dirichlet Allocation) - 이번 포스팅 주제임
NMF(Non-Negative Matrix Factorization)

행렬분해 기반 토픽 모델링 : LSA,NMG
확률 기반 토픽 모델링 : LDA,pLSA

어떤 토픽 모델링 알고리즘이든 아래 2가지의 가정 을 기반으로 하고 있다.
개별 문서(Document)는 혼합된 여러개의 주제로 구성되어 있음.
개별 주제는 겨러개의 단어로 구성되어 있음.

from sklearn.datasets import fetch_20newsgroups

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출. 
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x',
        'talk.politics.mideast', 'soc.religion.christian', 'sci.electronics', 'sci.med'  ]
# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 categories에 cats 입력
news_df= fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'), categories=cats, random_state=0)
print(type(news_df))
print(news_df.keys())
print(type(news_df.data), type(news_df.target))
print(news_df.target.shape)
print(news_df.target[0])
print(news_df.data[0])
<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
<class 'list'> <class 'numpy.ndarray'>
(7862,)
6
I appreciate if anyone can point out some good books about the dead sea
scrolls of Qumran. Thanks in advance.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# LDA 는 Count기반의 Vectorizer만 적용합니다.  
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2, stop_words='english', ngram_range=(1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape:', feat_vect.shape)
CountVectorizer Shape: (7862, 1000)

max_features=1000 하면서, 자연스럽게 단어 1000 개만 사용하는 것으로 정함
7862 중 1개만 뜯어서 보면, 해당 어휘사전의 값으로 구성되어 있다. 단, CSR-matrix 형태이니, 변환시 데이터 type을 고려해서 봐야한다.

print(feat_vect[0]) ## csr_matrix 라서, 나온 것임.
  (0, 93)	1
  (0, 669)	1
  (0, 390)	1
  (0, 148)	1
  (0, 251)	1
  (0, 876)	1
  (0, 70)	1
  (0, 877)	1
count_vect.inverse_transform(feat_vect[0])
[array(['appreciate', 'point', 'good', 'books', 'dead', 'thanks',
        'advance', 'thanks advance'], dtype='<U14')]
[key for key,val in count_vect.vocabulary_.items() if val in feat_vect[0].indices]
['appreciate',
 'point',
 'good',
 'books',
 'dead',
 'thanks',
 'advance',
 'thanks advance']
print(feat_vect[0].toarray())
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

LDA 객체 생성 후 Count 피처 벡터화 객체로 LDA수행

디리클레분포(Dirichlet distribution)

lda = LatentDirichletAllocation(n_components=8, random_state=0) ## 주제가 8개임을 임의로 정한 것이다.  
## feat_vect 를 fit 하면, 초기 lda n_components 수에 따라, 초기 분포가 결정된다.
lda.fit(feat_vect) ## feat_vect : CountVectorizer로 vector와 된 값. BOW 계열 vectorize 되었다.  
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=8, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

각 토픽 모델링 주제별 단어들의 연관도 확인

  • lda객체의 components_ 속성은 주제별로 개별 단어들의 연관도 정규화 숫자가 들어있음
  • shape는 주제 개수 X 피처 단어 개수
  • components_ 에 들어 있는 숫자값은 각 주제별로 단어가 나타난 횟수를 정규화 하여 나타냄.
  • 숫자가 클 수록 토픽에서 단어가 차지하는 비중이 높음
print(lda.components_.shape)
lda.components_
(8, 1000)





array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ...,
        3.02911688e+01, 8.66830093e+01, 6.79285199e+01],
       [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ...,
        1.81506995e+02, 1.25097844e-01, 9.39593286e+01],
       [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ...,
        1.25105772e-01, 3.63689741e+01, 1.25025218e-01],
       ...,
       [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ...,
        1.45056650e+01, 8.33854413e+00, 1.55690009e+01],
       [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ...,
        9.17278769e+01, 1.25177668e-01, 3.74575887e+01],
       [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ...,
        4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])

(8 by 1000) 으로 나온다. 8개의 topic 에 대해서, 결과를 보여준다.

# for test by SOO
lda.components_.argsort() ## index sort 열기준으로 앞부분에 있을수록..가장 작은 값이다.
array([[959, 484, 990, ..., 374,   7, 994],
       [296, 433, 690, ..., 517, 485, 291],
       [123, 124, 106, ..., 484, 353, 451],
       ...,
       [ 86,  69, 940, ..., 119, 205, 312],
       [296, 988, 295, ..., 479, 655, 386],
       [478, 496,  69, ..., 876, 295, 921]], dtype=int64)
lda.components_.argsort().shape
(8, 1000)
# for test by JJH
print(lda.components_[0][297], lda.components_[0][485], lda.components_[0][769], lda.components_[0][994])
print(lda.components_[7][297], lda.components_[7][691], lda.components_[7][486], lda.components_[7][518])
110.41185581607033 59.891798905153955 87.4359751747397 703.2389928959205
19.163149265944565 47.2505161189943 239.93631363605678 16.22916033549396

결과값을 보면 상관성있는 값을 return 하는 것으로 보인다. 연관관계? 상관계수같은 값은 아닌것 같다.

# for test by JJH
print(lda.components_[0].argsort()[::][:5], lda.components_[0].argsort()[::-1][:5])
print(lda.components_[7].argsort()[::][:5], lda.components_[7].argsort()[::-1][:5])
[959 484 990 960 361] [994   7 374 563 420]
[478 496  69 906 433] [921 295 876 966 928]
# for test by JJH
' + '.join([str('aaa')+'*'+str(10.33) for i in range(5)]) 
'aaa*10.33 + aaa*10.33 + aaa*10.33 + aaa*10.33 + aaa*10.33'

각 토픽별 중심 단어 확인

print(len(count_vect.get_feature_names()))
count_vect.get_feature_names()[210:220]
1000





['comments',
 'commercial',
 'common',
 'community',
 'comp',
 'company',
 'complete',
 'completely',
 'computer',
 'conference']
## no_top_words : 보고 싶은 상위 단어 갯수를 정하는 param
## feature_names : 학습한 어휘사전의 단어 이름
def display_topic_words(lda_model, feature_names, no_top_words):
    for topic_index, topic in enumerate(lda_model.components_): ## 8 by 1000 : 8개의 주제이니, 1개 주제씩 꺼낸다.
        print('\nTopic #',topic_index)

        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' + '.join([str(feature_names[i])+'*'+str(round(topic[i],1)) for i in top_indexes])                
        print(feature_concat)
# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출
feature_names = count_vect.get_feature_names()
# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topic_words(lda, feature_names, 15)

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출. 
Topic # 0
year*703.2 + 10*563.6 + game*476.3 + medical*413.2 + health*377.4 + team*346.8 + 12*343.9 + 20*340.9 + disease*332.1 + cancer*319.9 + 1993*318.3 + games*317.0 + years*306.5 + patients*299.8 + good*286.3

Topic # 1
don*1454.3 + just*1392.8 + like*1190.8 + know*1178.1 + people*836.9 + said*802.5 + think*799.7 + time*754.2 + ve*676.3 + didn*675.9 + right*636.3 + going*625.4 + say*620.7 + ll*583.9 + way*570.3

Topic # 2
image*1047.7 + file*999.1 + jpeg*799.1 + program*495.6 + gif*466.0 + images*443.7 + output*442.3 + format*442.3 + files*438.5 + color*406.3 + entry*387.6 + 00*334.8 + use*308.5 + bit*308.4 + 03*258.7

Topic # 3
like*620.7 + know*591.7 + don*543.7 + think*528.4 + use*514.3 + does*510.2 + just*509.1 + good*425.8 + time*417.4 + book*410.7 + read*402.9 + information*395.2 + people*393.5 + used*388.2 + post*368.4

Topic # 4
armenian*960.6 + israel*815.9 + armenians*699.7 + jews*690.9 + turkish*686.1 + people*653.0 + israeli*476.1 + jewish*467.0 + government*464.4 + war*417.8 + dos dos*401.1 + turkey*393.5 + arab*386.1 + armenia*346.3 + 000*345.2

Topic # 5
edu*1613.5 + com*841.4 + available*761.5 + graphics*708.0 + ftp*668.1 + data*517.9 + pub*508.2 + motif*460.4 + mail*453.3 + widget*447.4 + software*427.6 + mit*421.5 + information*417.3 + version*413.7 + sun*402.4

Topic # 6
god*2013.0 + people*721.0 + jesus*688.7 + church*663.0 + believe*563.0 + christ*553.1 + does*500.1 + christian*474.8 + say*468.6 + think*446.0 + christians*443.5 + bible*422.9 + faith*420.1 + sin*396.5 + life*371.2

Topic # 7
use*685.8 + dos*635.0 + thanks*596.0 + windows*548.7 + using*486.5 + window*483.1 + does*456.2 + display*389.1 + help*385.2 + like*382.8 + problem*375.7 + server*370.2 + need*366.3 + know*355.5 + run*315.3

개별 문서별 토픽 분포 확인

  • lda객체의 transform()을 수행하면 개별 문서별 토픽 분포를 반환함
doc_topics = lda.transform(feat_vect)
print(doc_topics.shape)
print(doc_topics[:3])
(7862, 8)
[[0.01389701 0.01394362 0.01389104 0.48221844 0.01397882 0.01389205
  0.01393501 0.43424401]
 [0.27750436 0.18151826 0.0021208  0.53037189 0.00212129 0.00212102
  0.00212113 0.00212125]
 [0.00544459 0.22166575 0.00544539 0.00544528 0.00544039 0.00544168
  0.00544182 0.74567512]]

개별 문서별 토픽 분포도를 출력

  • 20-newsgroup으로 만들어진 문서명을 출력
  • fetch_20newsgroups()으로 만들어진 데이터의 filename속성은 모든 문서의 문서명을 가지고 있음
  • filename속성은 절대 디렉토리를 가지는 문서명을 가지고 있으므로 ‘\‘로 분할하여 맨 마지막 두번째 부터 파일명으로 가져옴
# for test by SOO
print(news_df.filenames)
['C:\\Users\\정진환\\scikit_learn_data\\20news_home\\20news-bydate-train\\soc.religion.christian\\20630'
 'C:\\Users\\정진환\\scikit_learn_data\\20news_home\\20news-bydate-test\\sci.med\\59422'
 'C:\\Users\\정진환\\scikit_learn_data\\20news_home\\20news-bydate-test\\comp.graphics\\38765'
 ...
 'C:\\Users\\정진환\\scikit_learn_data\\20news_home\\20news-bydate-train\\rec.sport.baseball\\102656'
 'C:\\Users\\정진환\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.electronics\\53606'
 'C:\\Users\\정진환\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76505']
def get_filename_list(newsdata):
    filename_list=[]

    for file in newsdata.filenames:
            filename_temp = file.split('\\')[-2:]
            filename = '.'.join(filename_temp)
            filename_list.append(filename)
    
    return filename_list

filename_list = get_filename_list(news_df)
print("filename 개수:",len(filename_list), "filename list 10개만:",filename_list[:10])
filename 개수: 7862 filename list 10개만: ['soc.religion.christian.20630', 'sci.med.59422', 'comp.graphics.38765', 'comp.graphics.38810', 'sci.med.59449', 'comp.graphics.38461', 'comp.windows.x.66959', 'rec.motorcycles.104487', 'sci.electronics.53875', 'sci.electronics.53617']

DataFrame으로 생성하여 문서별 토픽 분포도 확인

import pandas as pd 

topic_names = ['Topic #'+ str(i) for i in range(0, 8)]
doc_topic_df = pd.DataFrame(data=doc_topics, columns=topic_names, index=filename_list)
doc_topic_df.head(20)
# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제
Topic #0 Topic #1 Topic #2 Topic #3 Topic #4 Topic #5 Topic #6 Topic #7
soc.religion.christian.20630 0.013897 0.013944 0.013891 0.482218 0.013979 0.013892 0.013935 0.434244
sci.med.59422 0.277504 0.181518 0.002121 0.530372 0.002121 0.002121 0.002121 0.002121
comp.graphics.38765 0.005445 0.221666 0.005445 0.005445 0.005440 0.005442 0.005442 0.745675
comp.graphics.38810 0.005439 0.005441 0.005449 0.578959 0.005440 0.388387 0.005442 0.005442
sci.med.59449 0.006584 0.552000 0.006587 0.408485 0.006585 0.006585 0.006588 0.006585
comp.graphics.38461 0.008342 0.008352 0.182622 0.767314 0.008335 0.008341 0.008343 0.008351
comp.windows.x.66959 0.372861 0.041667 0.377020 0.041668 0.041703 0.041703 0.041667 0.041711
rec.motorcycles.104487 0.225351 0.674669 0.004814 0.075920 0.004812 0.004812 0.004812 0.004810
sci.electronics.53875 0.008944 0.836686 0.008932 0.008941 0.008935 0.109691 0.008932 0.008938
sci.electronics.53617 0.041733 0.041720 0.708081 0.041742 0.041671 0.041669 0.041699 0.041686
sci.electronics.54089 0.001647 0.512634 0.001647 0.152375 0.001645 0.001649 0.001647 0.326757
rec.sport.baseball.102713 0.982653 0.000649 0.013455 0.000649 0.000648 0.000648 0.000649 0.000649
rec.sport.baseball.104711 0.288554 0.007358 0.007364 0.596561 0.078082 0.007363 0.007360 0.007358
comp.graphics.38232 0.044939 0.138461 0.375098 0.003914 0.003909 0.003911 0.003912 0.425856
sci.electronics.52732 0.017944 0.874782 0.017869 0.017904 0.017867 0.017866 0.017884 0.017885
talk.politics.mideast.76440 0.003381 0.003385 0.003381 0.843991 0.135716 0.003380 0.003384 0.003382
sci.med.59243 0.491684 0.486865 0.003574 0.003577 0.003578 0.003574 0.003574 0.003574
talk.politics.mideast.75888 0.015639 0.499140 0.015641 0.015683 0.015640 0.406977 0.015644 0.015636
soc.religion.christian.21526 0.002455 0.164735 0.002455 0.002456 0.208655 0.002454 0.614333 0.002458
comp.windows.x.66408 0.000080 0.000080 0.809449 0.163054 0.000080 0.027097 0.000080 0.000080

Comments