본문 바로가기

Programming/Machine Learning

네이버 영화리뷰 감성분석

구글 konlypy 검색해서 다운받기 > 시작하기 > 사용하기 >

설치하기 > 윈도우 > cmd 들어가서 java 쳐보고 내용 나오는가 확인 >

JPype1 (>=0.5.7)을 다운로드 클릭 > JPype1‑0.7.1‑cp37‑cp37m‑win_amd64.whl 설치 >

주피터노트북으로 test, train 데이터 가져오기

 

구글 네이버 영화 리뷰 데이터셋 검색 > github 다운받기 > 주피터노트북으로 폴더 넣기

 

import pandas as pd
pd.set_option('display.max_colwidth',-1)

df_train = pd.read_csv('data/ratings_train.txt', delimiter='\t')

df_train.head(30)

df_train.info()

df_train.dropna(inplace=True)

df_train.info()

df_test = pd.read_csv('data/ratings_test.txt', delimiter='\t')

df_test.info()

df_test.dropna(inplace=True)

 

train, test

text_train = df_train['document']
y_train = df_train['label']
text_test = df_test['document']
y_test = df_test['label']
print(text_train.shape)
print(y_train.shape)
print(text_test.shape)
print(y_test.shape)

from sklearn.feature_extraction.text import TfidfVectorizer

tmp_tf_idf_vect = TfidfVectorizer()
tmp_tf_idf_vect.fit(text_train[:3])

tmp_tf_idf_vect.vocabulary_

 

konlpy - kkma 사용하기

from konlpy.tag import Kkma

kkma = Kkma()

kkma.nouns(text_train[0])

def myTokenizer(text):
    return kkma.nouns(text)
    
tmp_tf_idf_vect = TfidfVectorizer(tokenizer=myTokenizer)
tmp_tf_idf_vect.fit(text_train[:3])
tmp_tf_idf_vect.vocabulary_

 

pos tagging 활용

data = '먹는다 먹다 먹었다 이쁘다 아름답다 사진 모자'
kkma.morphs(data)

kkma.pos(data)

kkma.tagset

d = pd.DataFrame(kkma.pos(data), columns=['morph','tag'])
d.set_index('tag', inplace=True)
d.loc[['VV','VA','NNG']]

def myTokenizer2(text):
    d = pd.DataFrame(kkma.pos(text), columns=['morph','tag'])
    d.set_index('tag', inplace=True)
    if ('VV' in d.index) | ('VA' in d.index) | ('NNG' in d.index):
        return d.loc[['VV','VA','NNG']].dropna()['morph'].values
    else :
        return []
        
tmp_tf_idf_vect = TfidfVectorizer(tokenizer=myTokenizer2)
tmp_tf_idf_vect.fit(text_train[:3])
tmp_tf_idf_vect.vocabulary_

final_tf_idf_vect = TfidfVectorizer(tokenizer=myTokenizer2)
final_tf_idf_vect.fit(text_train[:10000])

len(final_tf_idf_vect.vocabulary_)

X_train = final_tf_idf_vect.transform(text_train[:10000])
X_test = final_tf_idf_vect.transform(text_test[:10000])

from sklearn.linear_model import LogisticRegression

logi = LogisticRegression()
logi.fit(X_train,y_train[:10000])
logi.score(X_test,y_test[:10000])

voc = pd.DataFrame(final_tf_idf_vect.vocabulary_.keys(),
                  index=final_tf_idf_vect.vocabulary_.values())
voc.sort_index()[0]

learning_result = pd.DataFrame(logi.coef_.T , index=voc.sort_index()[0],
                              columns=['w'])
learning_result.sort_values(by='w')