https://ai.stanford.edu/~amaas/data/sentiment/
Large Movie Review Dataset v1.0 다운로드
목표
- 영화리뷰데이터를 활용해 긍정,부정 감성분석을 해보자.
- 텍스트데이터를 다루는 방법을 이해하자.
from sklearn.datasets import load_files
import numpy as np
data_url = 'data/aclImdb/train/'
reviews_train = load_files(data_url,shuffle=True)
data_url = 'data/aclImdb/test/'
reviews_test = load_files(data_url,shuffle=True)
reviews_train.keys()
text_train, y_train = reviews_train.data, reviews_train.target
text_test, y_test = reviews_test.data, reviews_test.target
print('train size : ' , len(text_train))
print('test size : ' , len(text_test))
np.bincount(y_train)
텍스트 데이터 전처리
text_train[:10]
br 태그 정리
text_train = [doc.replace(b'<br />',b' ') for doc in text_train]
text_train[:10]
text_test = [doc.replace(b'<br />',b' ') for doc in text_test]
text_test[:10]
토큰화 & 특성추출(수치화) - BOW
from sklearn.feature_extraction.text import CountVectorizer
test_words = ['I love you', 'Hello my name is haedo','good choice', '안녕 나는 네모야']
vect = CountVectorizer()
vect.fit(test_words)
print(vect.vocabulary_)
vect.transform(test_words).toarray()
train,test BOW 적용
movie_count_vec = CountVectorizer()
movie_count_vec.fit(text_train) # 훈련데이터를 기준으로 단어사전 구축
len(movie_count_vec.vocabulary_)
X_train = movie_count_vec.transform(text_train)
X_train
X_test = movie_count_vec.transform(text_test)
X_test
학습
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
logi = LogisticRegression()
scv = LinearSVC()
logi_result = cross_val_score(logi,X_train,y_train, cv = 3)
logi_result.mean()
scv_result = cross_val_score(scv,X_train,y_train, cv = 3)
scv_result.mean()
평가
logi.fit(X_train,y_train)
logi.score(X_test, y_test)
scv.fit(X_train,y_train)
scv.score(X_test, y_test)
활용
data = ["This was a horrible movie. It's a waste of time and money. It was like watching Desperately Seeking Susan meets Boo from Monsters Inc."]
transform_data = movie_count_vec.transform(data)
logi.predict(transform_data)
pipeline
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(CountVectorizer(),LogisticRegression())
pipe.fit(text_train,y_train)
pipe.predict(data)
하이퍼파라미터 튜닝
CountVectorizer 하이퍼파라미터
- min_df : 전체 문서중에 등장해야하는 빈도의 최소치 설정
- max_df : 전체 문서중에 등장해야하는 빈도의 최대치 설정
- n_gram : 토큰을 묶는 숫자 설정 (유니그램, 바이그램, 트라이그램)
LogisticRegression 하이퍼파라미터
- C (규제) : 선형모델의 가중치를 규제하는 정도 설정
from sklearn.model_selection import GridSearchCV
grid_params = {
'countvectorizer__min_df' : [3,5,10],
'countvectorizer__max_df' : [20000,22000,24000],
'countvectorizer__ngram_range' : [(1,2),(1,3),(2,2)],
'logisticregression__C' : [0.001,0.01,0.1,10,100]
}
grid = GridSearchCV(pipe, grid_params, cv=3, n_jobs=-1)
grid.fit(text_train, y_train)
print(grid.best_score_)
print(grid.best_params_)
best_pipe = grid.best_estimator_
best_pipe.score(text_test,y_test)
tf-idf
- 전체 문서에서 등장하는 횟수 : df
- 하나의 문서에서 등장하는 횟수 : tf
- df와 tf를 이용해서 수치화를 정교하게 하는 방법
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(text_train[:3])
len(tfidf_vect.vocabulary_)
tfidf_vect.transform(text_train[:3]).toarray()
'Programming > Machine Learning' 카테고리의 다른 글
정밀도, 재현율, F1score (0) | 2020.03.11 |
---|---|
네이버 영화리뷰 감성분석 (0) | 2020.02.17 |
Titanic데이터활용_DecisionTree_Ensemble (0) | 2020.02.17 |
손글씨 분류 실습 (0) | 2020.02.17 |
보스턴주택 값 예측 (0) | 2020.02.17 |