목표
- 손글씨 숫자(0~9)를 분류하는 모델을 만들어보자.
- 선형 분류모델의 불확실성을 확인
- 이미지 데이터의 형태를 이해
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
digit_data = pd.read_csv('data/digit_train.csv')
digit_data.head()
digit_data.shape
EDA
img0 = digit_data.iloc[0, 1:]
print(img0.max())
print(img0.min())
plt.hist(img0, bins = 255)
plt.show()
plt.imshow(img0.values.reshape(28, 28))
# plt.imshow(img0.values.reshape(28, 28), cmap='gray')
plt.show()
img0 = digit_data.iloc[100, 1:]
plt.imshow(img0.values.reshape(28, 28))
plt.show()
5000장 추출
X = digit_data.iloc[:5000, 1:]
y = digit_data.iloc[:5000, 0]
print(X.shape)
print(y.shape)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)
모델링(교차검증)
- KNN
- Decision tree
- Logistic regression
- Linear SVM
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
knn = KNeighborsClassifier()
tree = DecisionTreeClassifier()
logi = LogisticRegression()
svm = LinearSVC()
knn_result = cross_val_score(knn, X_train, y_train, cv=5)
tree_result = cross_val_score(tree, X_train, y_train, cv=5)
logi_result = cross_val_score(logi, X_train, y_train, cv=5)
svm_result = cross_val_score(svm, X_train, y_train, cv=5)
print('knn : ', knn_result.mean())
print('tree : ', tree_result.mean())
print('logi : ', logi_result.mean())
print('svm : ', svm_result.mean())
tree2 = DecisionTreeClassifier(max_depth=5)
tree_result2 = cross_val_score(tree2, X_train, y_train, cv=5)
print('tree(max_depth:5)', tree_result2.mean())
스케일링
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(X_train)
X_train_scaled = min_max_scaler.transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)
plt.hist(X_train_scaled[0])
svm_result2 = cross_val_score(svm,X_train_scaled, y_train,cv=5)
print(svm_result2.mean())
test 데이터 확인
knn.fit(X_train, y_train)
tree.fit(X_train, y_train)
logi.fit(X_train, y_train)
svm.fit(X_train_scaled, y_train)
print('knn : ', knn.score(X_test, y_test))
print('tree : ', tree.score(X_test, y_test))
print('logi : ', logi.score(X_test, y_test))
print('svm : ', svm.score(X_test, y_test))
예측의 불확실성
knn.predict_proba(X_test[50:80])
tree.predict_proba(X_test[100:200])
tree3 = DecisionTreeClassifier(max_depth=5)
tree3.fit(X_train, y_train)
tree3.predict_proba(X_test[100:200])
logi.predict_proba(X_test[50:80])
img2 = X_test.iloc[52]
plt.imshow(img2.values.reshape(28, 28), cmap="gray")
plt.show()
분류평가지표
from sklearn.metrics import classification_report
pre = knn.predict(X_test)
print(classification_report(pre, y_test))
pre2 = tree.predict(X_test)
print(classification_report(pre2, y_test))
'Programming > Machine Learning' 카테고리의 다른 글
Sentiment Analysis(영화리뷰데이터) (0) | 2020.02.17 |
---|---|
Titanic데이터활용_DecisionTree_Ensemble (0) | 2020.02.17 |
보스턴주택 값 예측 (0) | 2020.02.17 |
Linear Model - Regression (0) | 2020.02.17 |
타이타닉 생존자 예측 분석 (0) | 2020.02.17 |